lh3 / bgt

Flexible genotype query among 30,000+ samples whole-genome
MIT License
96 stars 10 forks source link

BGT issue with Multi Allelic Variant Sites #16

Open rick-heig opened 2 years ago

rick-heig commented 2 years ago

Hello. As a test I ran BGT on chrX of 1KGP3 (available from the FTP link below) ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1c.20130502.genotypes.vcf.gz

The commands I used were (I converted the vcf.gz file above to BCF with bcftools to save space) :

bgt import chrX.bcf out/chrX
bgt view -b out/chrX.bgt > out/chrX.bcf

When I compared the output file with the input file with bcftools view, the first variant is strangely split.

Expected (only first sample GT shown because of size...) :

X       60020   .       T       TA,TAAC 100     PASS    AC=10,92;AF=0.00199681,0.0183706;AN=5008;NS=2504;DP=11848;AMR_AF=0.0029,0.0086;AFR_AF=0.0008,0.0635;EUR_AF=0,0.002;SAS_AF=0.0031,0;EAS_AF=0.004,0;VT=INDEL;MULTI_ALLELIC        GT      0|0 ...

And I got two variants :

X       60020   .       T       TA,<M>   0     .      .        GT      0/0 ...
X       60020   .       T       TAAC,<M> 0     .      .        GT      0/0 ...

I understand the the multi-allelic site has been split however what is strange is that in both lines I get genotype values between 0, 1, and 2.

So how to interpret when for example 2/0 occurs in the first line or in the second line ?

Thanks. Best Rick