kundajelab / gecco-variants

0 stars 0 forks source link

Analysis approach: Train on subject variants inserted into the model rather than on reference sequence #2

Open annashcherbina opened 6 years ago

akundaje commented 6 years ago

Actually I think a better idea is to convert the one-hots to probabilities of alleles from the variant calling procedure from the data.

On Thu, Apr 5, 2018 at 1:37 PM, annashcherbina notifications@github.com wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EeZC-nOmQ6yXyMGhRufJT6Dad3Lqks5tloBxgaJpZM4TJHi6 .

annashcherbina commented 6 years ago

Yes, that makes sense. We discussed an 8 -channel model for phased genotypes -- do we currently have enough data to do the phasing (i.e. we only have WGS for 6 of the subjects, and they are not related). Or do we just provide the probabilities for a 4-channel model.

i.e. someone heterozygous AC would have [0.5,0.5,0,0] while someone homozygous AA would have [1,0,0,0] or homozygous CC would have [0,1,0,0].

annashcherbina commented 6 years ago

Or I guess we can get more realistic probabilities by looking at the read counts for each allele from the DNAse fastq data? But then that gets into the realm of sequencing noise... which might not matter because we have fairly deep coverage.

annashcherbina commented 6 years ago

Actually, we can still do phasing from LD w/ plink : http://zzz.bwh.harvard.edu/plink/haplo.shtml Is this worth trying, or should I just look at the empirical read fractions for each base and get 4-channel probabilities from that?

akundaje commented 6 years ago

4 channel model with prob is better because it can easily be integrated with a one-hot pretrained model on reference genome.

-A

On Fri, Apr 6, 2018 at 10:02 AM, annashcherbina notifications@github.com wrote:

Actually, we can still do phasing from LD w/ plink : http://zzz.bwh.harvard.edu/plink/haplo.shtml Is this worth trying, or should I just look at the empirical read fractions for each base and get 4-channel probabilities from that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/gecco-variants/issues/2#issuecomment-379314982, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EQlUbZEprtVanWbnZ8X2Hf48dGd1ks5tl5-ogaJpZM4TJHi6 .