Closed PVlasov93 closed 2 months ago
The VCF specification https://samtools.github.io/hts-specs/VCFv4.3.pdf states that such tag names are invalid
INFO keys must match the regular expression ^([A-Za-z ][0-9A-Za-z .]*|1000G)$, please note that “1000G”
is allowed as a special legacy value. Duplicate keys are not allowed. Arbitrary keys are permitted, although
those listed in Table 1 and described below are reserved (albeit optional)
It'd be best to notify the producer of these VCFs to fix their programs. What you can do on your end is to rename the tags using standard unix commands, for example
zless broken.vcf.gz | sed 's,1000g,x1000g,g' | gzip -c > fixed.vcf.gz
I've been trying to use bcftools view to select sequences from a few VCF files (specifically, the Denisovan genome VCFs from the Max Planck Institute plus a few others from UCSC) and each of those files results in an error that says the following: Invalid tag name: "1000gALT" The version of the error message on our lab's computer had something about htslib in a set of brackets, while trying the same steps on Galaxy gave the same error with a different name in the brackets. So I don't think it's an issue with our computer, but I may be wrong. The message mentions 1000g, which may be referencing the 1000 Genomes project, but those files never gave me the same error when I worked with them.
I spoke with somebody from a bioinformatics lab in our department who has more experience working with VCF files, and she was not sure about what caused the error either, beyond the fact that the files seemed to be formatted strangely.
Is this a known error? Is there a set of steps I need to take to fix it?