ultimatesource / denovogear

A program to detect denovo-variants using next-generation sequencing data.
http://www.nature.com/nmeth/journal/v10/n10/full/nmeth.2611.html
GNU General Public License v3.0
49 stars 25 forks source link

why this is not picked up as a de-novo variant #287

Closed jielab closed 5 years ago

jielab commented 6 years ago

Hi, I use the following command to identify de-novo vairant on the following test.vcf file. dng dnm auto --ped sample.ped --bcf test.vcf

It seems to me that this is a de-novo variant, but why denovogear did not pick it up?

Thanks!

=== sample.ped === 1 07932 0 0 1 1 1 07931 0 0 2 1 1 07930 07932 07931 1 2

=== test.vcf ===

fileformat=VCFv4.2

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

INFO=

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 07930 07931 07932

1 175116065 . A T 41.93 PASS DP=123;MQ=60 GT:AD:DP:GQ:PL 0/1:14,9:23:67:67,0,285 0/0:49,0:49:0:0,0,517 0/0:40,0:40:0:0,0,396

reedacartwright commented 6 years ago

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 07930 07931 07932

1 175116065 . A T 41.93 PASS DP=123;MQ=60 GT:AD:DP:GQ:PL 0/1:14,9:23:67:67,0,285 0/0:49,0:49:0:0,0,517 0/0:40,0:40:0:0,0,396

@jiehuang001 I bolded the PL fields, which dng dnm uses for genotype likelihoods. Both the mother and father have PLs indicating that their 0/0 and 0/1 genotypes are equally likely and their GT qualities are 0. Therefore, when dng dnm calculates the likelihood that there is a de novo at the site, it concludes that it is low because the child being heterozygous can be reasonably explained from one parent being heterozygous, which is supported by the PL fields.

Note that the PL fields and the AD fields appear to be inconsistent on this line, probably due to manual editing and not genotype calling.

jielab commented 6 years ago

Dear Reed:

Thank you very much!

Below is the raw VCF content for this particular SNP, when I use the “tabix” command to extract.

tabix merge.vcf.gz 1:175116065-175116065

1 175116065 . A T 41.93 . AC=1;AF=0.167;AN=6;BaseQRankSum=-2.625;ClippingRankSum=0.000;DP=123;ExcessHet=4.7712;FS=13.183;MLEAC=2;MLEAF=0.333;MQ=60.00;MQRankSum=0.000;QD=1.82;ReadPosRankSum=-0.126;SOR=4.119 GT:AD:DP:GQ:PL 0/1:14,9:23:67:67,0,285 0/0:49,0:49:0:0,0,517 0/0:40,0:40:0:0,0,396

For both the second sample (mother) and the third sample (father), the PL field starts with “0,0”. If this means that the “their 0/0 and 0/1 genotypes are equally likely”, then I feel a variant like this one should be called missing and therefore discarded. So, is there a way to set a threshold for PL field for DenovoGear? This is the first time I work with PL field. Previously I have used GP field from imputed data. Usually, we call those variants with a maximum GP <0.9 to be msising.

Thank you very much & best regards,

Jie

From: Reed A. Cartwright notifications@github.com Sent: 2018年8月28日 19:16 To: denovogear/denovogear denovogear@noreply.github.com Cc: jiehuang001 jiehuang001@gmail.com; Mention mention@noreply.github.com Subject: Re: [denovogear/denovogear] why this is not picked up as a de-novo variant (#287)

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 07930 07931 07932

1 175116065 . A T 41.93 PASS DP=123;MQ=60 GT:AD:DP:GQ:PL 0/1:14,9:23:67:67,0,285 0/0:49,0:49:0:0,0,517 0/0:40,0:40:0:0,0,396

@jiehuang001 https://github.com/jiehuang001 I bolded the PL fields, which dng dnm uses for genotype likelihoods. Both the mother and father have PLs indicating that their 0/0 and 0/1 genotypes are equally likely and their GT qualities are 0. Therefore, when dng dnm calculates the likelihood that there is a de novo at the site, it concludes that it is low because the child being heterozygous can be reasonably explained from one parent being heterozygous, which is supported by the PL fields.

Note that the PL fields and the AD fields appear to be inconsistent on this line, probably due to manual editing and not genotype calling.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/denovogear/denovogear/issues/287#issuecomment-416770699 , or mute the thread https://github.com/notifications/unsubscribe-auth/AZsvf55_tyH4w45_bFcgFP8JeIwRAyRYks5uVc8hgaJpZM4WM24x . https://github.com/notifications/beacon/AZsvf-Ed882hrwo8mbSSstklGBPm-t5jks5uVc8hgaJpZM4WM24x.gif

jielab commented 6 years ago

Dear Reed:

Thank you very much!

Below is the raw VCF content for this particular SNP, when I use the “tabix” command to extract. tabix merge.vcf.gz 1:175116065-175116065 1 175116065 . A T 41.93 . AC=1;AF=0.167;AN=6;BaseQRankSum=-2.625;ClippingRankSum=0.000;DP=123;ExcessHet=4.7712;FS=13.183;MLEAC=2;MLEAF=0.333;MQ=60.00;MQRankSum=0.000;QD=1.82;ReadPosRankSum=-0.126;SOR=4.119 GT:AD:DP:GQ:PL 0/1:14,9:23:67:67,0,285 0/0:49,0:49:0:0,0,517 0/0:40,0:40:0:0,0,396

For both the second sample (mother) and the third sample (father), the PL field starts with “0,0”. If this means that the “their 0/0 and 0/1 genotypes are equally likely”, then I feel a variant like this one should be called missing and therefore discarded. So, is there a way to set a threshold for PL field for DenovoGear? This is the first time I work with PL field. Previously I have used GP field from imputed data. Usually, we call those variants with a maximum GP <0.9 to be missing.

Thank you very much & best regards,

Jie

reedacartwright commented 6 years ago

I was wrong apparently GATK is responsible for the disconnection between PL and AD fields. Either this is a bug, or there is some indel or other complexity in the region and GATK is unable to keep the PL and AD in sync.

Regardless, I do not believe that dng dnm has an option to filter on PL or related fields. It relies on the user prefiltering their VCF, perhaps piping from bcftools. I also don't know why you would need such a filter because dng dnm didn't call a mutation there anyway. Its model automatically picked up on the low quality of genotype calls and responded accordingly.