kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
245 stars 42 forks source link

AD field looks wierd #156

Closed Yijun-Tian closed 2 years ago

Yijun-Tian commented 2 years ago

Hello, I ran the docker program and get some PEPPER_MARGIN_DEEPVARIANT_FINAL_OUTPUT.vcf.gz file. But the AD field doesn't look right:

user@/ONT_Open_data$ grep -v "#" FAQ32172Hap.vcf | head
1   10472   .   G   A   0.1 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:16:3:2:0.666667:DV:.
1   10475   .   A   G   0.1 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:16:3:2:0.666667:DV:.
1   10622   .   T   G   0.3 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:12:3:3:1:DV:.
1   10623   .   T   C   0.3 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:12:3:3:1:DV:.
1   10802   .   A   G   0   refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:19:3:2:0.666667:DV:.
1   10815   .   T   G   0.1 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:19:3:2:0.666667:DV:.
1   10821   .   T   C   0.1 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:16:3:2:0.666667:DV:.
1   10829   .   T   C   0   refCall .   GT:GQ:DP:AD:VAF:C:PS    0/0:20:3:2:0.666667:DV:.
1   10897   .   T   A   0.1 refCall .   GT:GQ:DP:AD:VAF:C:PS    ./.:18:3:3:1:DV:.
1   10927   .   A   G   0   refCall .   GT:GQ:DP:AD:VAF:C:PS    0/0:22:4:3:0.75:DV:.

Should I expect the AD to follow a pattern of "A,B"? I checked the logs and found a few error or warning messages: 1_pepper.log and 3_pepper_hp.log: /usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py:2095: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. warnings.warn("Exporting a model to ONNX with a batch_size other than 1, " + 5_deepvariant.log:

WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmphu89qgpk
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py:1692: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '

Is there anything wrong?

Thanks

kishwarshafin commented 2 years ago

Hi @Yijun-Tian , The warning are trivial, you can ignore those.

What specifically looks off to you about the AD field? Please know that not all reads will be considered here, reads with low base quality at that site and with low mapping quality will be excluded from this calculation. You can look at the site on IGV and color the reads based on their mapping quality to see if that's the issue here.

Yijun-Tian commented 2 years ago

Hi Kishwar, It's great to hear that the warning messages are OK. I just reviewed the BAM coverage and indeed there are some low quality bases there. But I expect the allelic depth field should present as "ref,alt" form for heterozygote variants in the file. But I checked and found such form only appear for a small proportion of the variants, which are all with multiple alternative alleles, see belows:

(base) user@user-x11dai-n:/mnt/raid0/Yijun_Tian/ONT_Open_data$ grep -v "#" FAQ32172Hap.vcf | cut -f7 | sort | uniq -c
1377524 PASS
1457745 refCall
(base) user@user-x11dai-n:/mnt/raid0/Yijun_Tian/ONT_Open_data$ grep -v "#" FAQ32172Hap.vcf | cut -f5 | grep "," | wc -l
5249
(base) user@user-x11dai-n:/mnt/raid0/Yijun_Tian/ONT_Open_data$ grep -v "#" FAQ32172Hap.vcf | cut -f10 | grep "," | wc -l
5249
(base) user@user-x11dai-n:/mnt/raid0/Yijun_Tian/ONT_Open_data$ grep -v "#" FAQ32172Hap.vcf | grep "," | head
1   981098  .   G   T,A 53.1    PASS    .   GT:GQ:DP:AD:VAF:C:PS    2/2:3:4:2,2:0.5,0.5:DV:.
1   1019806 .   G   T,A 4.6 PASS    .   GT:GQ:DP:AD:VAF:C:PS    0/2:2:4:2,2:0.5,0.5:DV:.
1   1077827 .   C   G,T 45.8    PASS    .   GT:GQ:DP:AD:VAF:C:PS    2/2:2:5:2,2:0.4,0.4:DV:.
1   1265603 .   TTGTGTGTGTGTGTGTGTGTG   T,TTGTGTGTGTGTGTGTGTGTGTGTGTGTG 13.5    PASS    .   GT:GQ:DP:AD:VAF:C:PS    0/2:4:8:2,2:0.25,0.25:DV:.
1   1595896 .   G   GAC,C   10.6    PASS    .   GT:GQ:DP:AD:VAF:C:PS    1/1:5:5:2,2:0.4,0.4:DV:.
1   2235710 .   C   T,A 52  PASS    .   GT:GQ:DP:AD:VAF:C:PS    2/2:6:4:2,2:0.5,0.5:DV:.
1   2660634 .   G   C,A 17  PASS    .   GT:GQ:DP:AD:VAF:C:PS    1/1:17:7:5,2:0.714,0.286:P:.
1   3361002 .   TTG TTCTG,T 7.8 PASS    .   GT:GQ:DP:AD:VAF:C:PS    1/2:2:8:2,2:0.25,0.25:DV:.
1   3619116 .   G   GTC,C   8.2 PASS    .   GT:GQ:DP:AD:VAF:C:PS    0/2:2:4:2,2:0.5,0.5:DV:.
1   3815090 .   C   T,CT    7.4 PASS    .   GT:GQ:DP:AD:VAF:C:PS    0/2:2:6:3,3:0.5,0.5:DV:.
kishwarshafin commented 2 years ago

Hi @Yijun-Tian, Lets take this call as an example:

1   1019806 .   G   T,A 4.6 PASS    .   GT:GQ:DP:AD:VAF:C:PS    0/2:2:4:2,2:0.5,0.5:DV:.

The ref is G, the alt-alleles at this position are T and A.

DP: Depth of the site is 4 AD: Allelic depth of T, A is 2,2 meaning T is supported by 2 reads and A is supported by 2 reads. VAF: Variant allele frequency = AD/DP are similarly 0.5 and 0.5

Does this clarify anything?

Yijun-Tian commented 2 years ago

Hi Kishwar, I think based on your description, this field should AC instead of AD. In below specs, the AD field is for "Total read depth for each allele" But the AC field for "Allele count in genotypes, for each ALT allele, in the same order as listed" Check v4.3 page 8

https://samtools.github.io/hts-specs/VCFv4.1.pdf https://samtools.github.io/hts-specs/VCFv4.2.pdf https://samtools.github.io/hts-specs/VCFv4.3.pdf

kishwarshafin commented 2 years ago

Hi @Yijun-Tian ,

v4.3 specifications were out on April 2022, we released this in March. In version v4.1 and v4.2 you won't see any specification for AD under "Fixed fields" section.

Thank you for bringing this to our attention tough, it's very helpful.