statgen / savvy

Interface to various variant calling formats.
Mozilla Public License 2.0
26 stars 5 forks source link

Output test file has an allele missing #18

Closed rick-heig closed 2 years ago

rick-heig commented 2 years ago

Hello.

I have been running a small test file to see if mixed ploidy (1 and 2) and mixed phasing variant lines are supported and found out that there seems to be an edge case where an allele is lost.

Here is the input (test_file.vcf) :

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##contig=<ID=20,assembly=b37,length=63025520>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5 SAMPLE6 SAMPLE7 SAMPLE8 SAMPLE9 SAMPLE10
20      60343   rs527639301     G       A       100     PASS    .       GT      0|0     1|0     1|.     0|0     0|0     0       1|0     0/1     0|0     0|0
20      60419   rs538242240     A       G       100     PASS    .       GT      0|0     1|0     0|0     1|1     0|0     0|0     1|1     0|0     1|0     1|0
20      60479   rs149529999     C       T       100     PASS    .       GT      0/0     0|1     0/0     0/0     0/0     0/0     0/0     0/0     0|1     0|0
20      60522   rs150241001     T       TC      100     PASS    .       GT      0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
20      60568   rs533509214     A       C       100     PASS    .       GT      0|0     ./0     0|0     0|0     0|0     0|0     0|0     0       0|0     0|0
20      60571   rs116145529     C       A       100     PASS    .       GT      1       0|0     0|1     0|0     0|0     1|1     0|0     1|0     0|0     0|0
20      60795   rs184056664     G       C       100     PASS    .       GT      0/1     0|0     0|0     1|1     0|0     0|0     1|0     0|0     0|0     0
20      60808   rs534548532     G       A       100     PASS    .       GT      0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
20      60810   rs527408846     G       GA      100     PASS    .       GT      0       1       0       0       0       0       0       0       0       0|0
20      60826   rs557778563     A       G       100     PASS    .       GT      1       0       0       0       0       0       0       0       0       0

After running :

sav import test_file.vcf --phasing partial --pbwt GT test_file.sav
sav export test_file.sav > test_file.sav.vcf

I get :

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20220128
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##contig=<ID=20,assembly=b37,length=63025520>
##FORMAT=<ID=GT, Type=String, Number=1, Description="Genotype">
##phasing=partial
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5 SAMPLE6 SAMPLE7 SAMPLE8 SAMPLE9 SAMPLE10
20      60343   rs527639301     G       A       100     PASS    .       GT      0|0     1|0     1|.     0|0     0|0     0       1|0     0/1     0|0     0|0
20      60419   rs538242240     A       G       100     PASS    .       GT      0|0     1|0     0|0     1|1     0|0     0|0     1|1     0|0     1|0     1|0
20      60479   rs149529999     C       T       100     PASS    .       GT      0/0     0|1     0/0     0/0     0/0     0/0     0/0     0/0     0|1     0|0
20      60522   rs150241001     T       TC      100     PASS    .       GT      0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
20      60568   rs533509214     A       C       100     PASS    .       GT      0|0     ./0     0|0     0|0     0|0     0|0     0|0     0       0|0     0|0
20      60571   rs116145529     C       A       100     PASS    .       GT      1       0|0     0|1     0|0     0|0     1|1     0|0     1|0     0|0     0|0
20      60795   rs184056664     G       C       100     PASS    .       GT      0/1     0|0     0|0     1|1     0|0     0|0     1|0     0|0     0|0     0
20      60808   rs534548532     G       A       100     PASS    .       GT      0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
20      60810   rs527408846     G       GA      100     PASS    .       GT      0       1       0       0       0       0       0       0       0       0
20      60826   rs557778563     A       G       100     PASS    .       GT      1       0       0       0       0       0       0       0       0       0

The two following lines differ :

20      60810   rs527408846     G       GA      100     PASS    .       GT      0       1       0       0       0       0       0       0       0       0|0
20      60810   rs527408846     G       GA      100     PASS    .       GT      0       1       0       0       0       0       0       0       0       0

The last sample is now haploid instead of diploid on that variant loci, an allele is lost (0|0 becomes 0).

Best, Rick

rick-heig commented 2 years ago

Version :

sav -v
sav v2.0.1
jonathonl commented 2 years ago

Hi Rick,

I believe this is already fixed with https://github.com/statgen/savvy/commit/a6f39a4884f3c21881a15a21735dee033e2b3d7b. I tested your example with the latest from the master branch and the genotype in question remains diploid after the round trip. Though I can reproduce the issue by using the v2.0.1 release. A new release is well overdue (my apologies).

jonathonl commented 2 years ago

Side note: you will need to manually set a --sparse-threshold that is less than 1.0 in order for --pbwt GT to have any effect. My current recommendation is --sparse-threshold 0.01. Better documentation of this is definitely needed.

rick-heig commented 2 years ago

Thank you for the comments, I get the recommendation for the threshold, I was just testing some edge cases with the very small test file and forgot to set a threshold but found this issue, so I shared it. Good to know it is already taken care of and will be fixed in the next release.

Best regards. Rick