Closed mashu closed 1 year ago
Thanks for reporting this issue. Do you have a link to the file so I can try to reproduce?
This standard file https://ftp.ensembl.org/pub/release-109/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz It reads fine with bcftools and also I could read records with pyvcf (in Python), so file looks correct.
This is one of the lines causing problems:
1 20086175 rs387906795 G A,T . . dbSNP_154;TSA=SNV;E_Freq;E_1000G;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPMed;E_gnomAD;CLIN_uncertain_significance;;MA=T;MAF=0.0002;MAC=1;AA=G;AFR=0,0;AMR=0,0;EAS=0,0;EUR=0,0;SAS=0,0.001
In particular, it has CLIN_uncertain_significance;;MA=T
, that is, a double semicolon between two keys.
I would say that this is not allowed according to the VCF specification (see https://samtools.github.io/hts-specs/ for different versions of the spec) and I'm thus reluctant to change the parser.
It's unfortunate that 1000 Genomes provide files that do not follow the spec.
Closing this issue since no change is currently planned. Feel free to reopen/comment if you want to continue the discussion.
Thank you for developing this parser. Unfortunately, I encountered a potentially related problem, while parsing the following 1k genomes dataset.
"ERROR: VariantCallFormat.Reader file format error on line 254 ~>";esv3585"
Do you happen to know an easy workaround which does not involve changing the parser?
This is another issue, this time it's about the ID column.
In the file, there are lines like this one:
1 2397655 .;esv3585028 C ...
i.e. the ID looks like this .;esv3585028
.
It's a weird ID, because .
means missing value, but then the list has one more entry and that one has a value.
The correct way to represent this would be to just keep esv3585028
.
It's a kind of a grey area when I read the VCF spec, but my interpretation is that this shouldn't be allowed.
My suggestion would be to preprocess the file (e.g. using some linux tool like awk or sed if you are familiar with those) to remove .;
from the start of IDs.
Would that work for you?
Thank you for helping a non-expert understand the issue.
I planned to use VCFTools.jl convert_gt, but circumvented this problem with the help of R, as I had no spare disk space left to pre-process these files.
Thank you for your time!