liguowang / CrossMap

CrossMap is a python program to lift over genome coordinates from one genome version to another.
https://crossmap.readthedocs.io/en/latest/
Other
64 stars 23 forks source link

CrossMap breaks VCF by replacing "*" genotype with a duplicate variant allele #70

Open gnxsf opened 2 months ago

gnxsf commented 2 months ago

CrossMap.py vcf command seems to replace "*" genotype (indicating a deletion in one or more samples), with a nucleotide sequence. Here are the #CHROM POS ID REF ALT columns of the problematic position before liftover:

ch01      331     .       AATATATATAT     AAT,AATAT,*,A,AATATAT,AATATATATATAT

Here is the same position after liftover:

ch01       16355   .       AATATATATAT     AAT,AATAT,A,A,AATATAT,AATATATATATAT

You can see that the "*" genotype has been replaced by "A". When running gatk ValidateVariants to validate the VCF, this results in the following error:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 63: Duplicate allele added to VariantContext: A

I realize this is not a lot to go on but the data is proprietary so I can't share the VCFs to make this bug reproducible. I'm curious whether this is a known issue, or if anyone has a suggestion on how to get around this problem.

liguowang commented 2 months ago

The problem is that there are some flexibilities in the VCF format. My understanding is that * is not a valid genotype in VCF file (please correct me if I was wrong). Does it mean a complete deletion of the REF allele?

gnxsf commented 2 months ago

Taken from the VCFv4.2 format specification found here:

The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used.

Essentially, the asterisk is used at positions where some samples have that position deleted but other samples have an alternate allele.

liguowang commented 2 months ago

Did you use the most recent version (v0.7.1)? I just checked the code, it should be able to handle the "*" allele (i.e., the asterisk will be kept as is).

gnxsf commented 2 months ago

It looks like the docker container I had pulled from amazon public repository was indeed using an older version. I built a new docker container using the most recent version, which did solve the "*" replacement issue. However, I'm still getting the same Duplicate allele added to VariantContext: A error when I try to validate the VCF file. I believe this is the offending line:

ch01 751408 . A *,A

Is it expected behavior to have the reference and alternate allele be the same? Is there a way to know what the source position is for this variant in the original file? That would help with troubleshooting.