Open gnxsf opened 2 months ago
The problem is that there are some flexibilities in the VCF format. My understanding is that * is not a valid genotype in VCF file (please correct me if I was wrong). Does it mean a complete deletion of the REF allele?
Taken from the VCFv4.2 format specification found here:
The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used.
Essentially, the asterisk is used at positions where some samples have that position deleted but other samples have an alternate allele.
Did you use the most recent version (v0.7.1)? I just checked the code, it should be able to handle the "*" allele (i.e., the asterisk will be kept as is).
It looks like the docker container I had pulled from amazon public repository was indeed using an older version. I built a new docker container using the most recent version, which did solve the "*" replacement issue. However, I'm still getting the same Duplicate allele added to VariantContext: A
error when I try to validate the VCF file. I believe this is the offending line:
ch01 751408 . A *,A
Is it expected behavior to have the reference and alternate allele be the same? Is there a way to know what the source position is for this variant in the original file? That would help with troubleshooting.
CrossMap.py vcf
command seems to replace "*" genotype (indicating a deletion in one or more samples), with a nucleotide sequence. Here are the#CHROM POS ID REF ALT
columns of the problematic position before liftover:Here is the same position after liftover:
You can see that the "*" genotype has been replaced by "A". When running
gatk ValidateVariants
to validate the VCF, this results in the following error:I realize this is not a lot to go on but the data is proprietary so I can't share the VCFs to make this bug reproducible. I'm curious whether this is a known issue, or if anyone has a suggestion on how to get around this problem.