samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
633 stars 241 forks source link

bcftools 1.20 merge (--merge both) gvcfs results in more than one variant at the same location for very complex polyallelic MNP variants #2215

Open jcm6t opened 1 week ago

jcm6t commented 1 week ago

I'm not clear if this is expected behavior or a bug. This is a test region of chr22 with a selected complex polyallelic variant SNP + indels

bcftools 1.20, file formats are VCF 4.2.

$BCFTOOLS -v
bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Individual whole genome gvcfs were generated from DRAGEN. Attempting to merge 1379 gcvf files using the following pipeline (same samples as #1891) :

$BCFTOOLS merge --merge both  -f PASS,. --gvcf $REFGENOME -i DP:sum,MQ:avg,DQUAL:min -r chr22:1-10590020  --file-list $gvcflistfile -Ob > testmerge_both_1-20_chr22.bcf &

In the resulting bcf file I find the following two variants at the same location. There may be other other complex variants but this is an exemplar. Since there are 1379 samples, just showing the first sample (first 10 cols):

$BCFTOOLS view testmerge_both_1-20_chr22.bcf |cut -f1-10 | grep 10590013

chr22   10590013    .   AAT AAAAATAT,TAT,<NON_REF>,AAAATATAT,ATAT,AAAAAATAT,AAAAAATATAT,AAATAT,ATATAT,A,AATAT,AAAAATATAT,AAAAAAATAT,AAAATAT 51.72   PASS    MQRankSum=-2.06;ReadPosRankSum
=0.246;FractionInformativeReads=0.971;DP=11694;DQUAL=0;MQ=13.5687   GT:AD:AF:DP:F1R2:F2R1:GQ:DGQ:PL:SPL:ICNT:PRI:SB:MB:MIN_DP:DGT:PS    0/1:10,19,5,0,0,0,0,0,0,0,0,0,0,0,0:0.559,0.147,0,1.96
538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18:34:6,9,2,0,0,0,0,0,0,0,0,0,0,0,0:4,10,3,0,0,0,0,0,0,0,0,0,0,0,0:1:13:0
,0,11,696,0,136,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,79
6,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,
796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796:255,0,255:9,0:0,19,22.01,34.77,53.77,37.78,34.77,53.77,69.54,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,3
7.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,3
7.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,3
7.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,3
7.78,37.78:3,7,13,11:5,5,14,10:.:.:.
chr22   10590013    .   AAT <NON_REF>,ATAT,A,AAAAAATAT,AAAAATAT,AAAAATATAT,ATATAT,AAAAAAAATAT,AAAATATAT,AATATATAT,AAATAT    48.82   PASS    MQRankSum=0.212;ReadPosRankSum=1.991;FractionI
nformativeReads=1;DP=3557;DQUAL=13.47;MQ=15.9321    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT:AF:F1R2:F2R1:DGQ:PRI:SB:MB:PS:DGT    ./.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.

if I repeat but use --merge both,* I still see the two variants but the only diff seems to be that the alleles have been removed:

chr22   10590013    .   AAT AAAAATAT,TAT,AAAATATAT,ATAT,AAAAAATAT,AAAAAATATAT,AAATAT,ATATAT,A,AATAT,AAAAATATAT,AAAAAAATAT,AAAATAT   51.72   PASS    MQRankSum=-2.06;ReadPosRankSum=0.246;F
ractionInformativeReads=0.971;DP=11694;DQUAL=0;MQ=13.5687   GT:AD:AF:DP:F1R2:F2R1:GQ:DGQ:PL:SPL:ICNT:PRI:SB:MB:MIN_DP:DGT:PS    0/1:10,19,5,0,0,0,0,0,0,0,0,0,0,0:0.559,0.147,1.96538e-18,1.96
538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18,1.96538e-18:34:6,9,2,0,0,0,0,0,0,0,0,0,0,0:4,10,3,0,0,0,0,0,0,0,0,0,0,0:1:13:0,0,11,696,0,136,
796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,79
6,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796,796:
255,0,255:9,0:0,19,22.01,34.77,53.77,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78
,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78
,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78,37.78
,37.78,37.78,37.78,37.78,37.78,37.78,37.78:3,7,13,11:5,5,14,10:.:.:.
chr22   10590013    .   AAT ATAT,A,AAAAAATAT,AAAAATAT,AAAAATATAT,ATATAT,AAAAAAAATAT,AAAATATAT,AATATATAT,AAATAT  48.82   PASS    MQRankSum=0.212;ReadPosRankSum=1.991;FractionInformati
veReads=1;DP=3557;DQUAL=13.47;MQ=15.9321    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT:AF:F1R2:F2R1:DGQ:PRI:SB:MB:PS:DGT    ./.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.

If compare the samples that are contributing to variant 1 and variant 2 (non missing genotype/format entry) there are 37/1379 that have a non-missing genotype in both variants. The majority are non-missing in one or the other.

So question: Should I see only one variant entry using --merge both or --merge both,* (max alleles doesn't appear to be exceeded here). ?

If not, then what attributes force the variants to remain separate ?

Thanks very much,