1000Genomes VCF not parsing

mashu commented 1 year ago

shell> ls
1000GENOMES-phase_3.vcf.gz  Manifest.toml  Project.toml

julia> reader = VCF.Reader(GzipDecompressorStream(open("1000GENOMES-phase_3.vcf.gz")))
VariantCallFormat.Reader(BioCore.Ragel.State{BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}}(BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}(<128.0 KiB buffer, 95% filled, data immobilized>), 1, 68, false), VariantCallFormat.Header:
  metainfo tags: fileformat fileDate source reference INFO contig
     sample IDs: )

julia> for record in reader
       end
ERROR: VariantCallFormat.Reader file format error on line 68 ~>";MA=T;MA"
Stacktrace:
 [1] error(::Type, ::String, ::Int64, ::String, ::String)
   @ Base ./error.jl:44
 [2] _read!(reader::VariantCallFormat.Reader, state::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}}, record::VariantCallFormat.Record)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:164
 [3] read!(reader::VariantCallFormat.Reader, record::VariantCallFormat.Record)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:134
 [4] tryread!(reader::VariantCallFormat.Reader, output::VariantCallFormat.Record)
   @ BioCore.IO ~/.julia/packages/BioCore/YBJvb/src/IO.jl:73
 [5] iterate(reader::VariantCallFormat.Reader, nextone::VariantCallFormat.Record)
   @ BioCore.IO ~/.julia/packages/BioCore/YBJvb/src/IO.jl:84
 [6] top-level scope
   @ ./REPL[10]:2

julia>

rasmushenningsson commented 1 year ago

Thanks for reporting this issue. Do you have a link to the file so I can try to reproduce?

mashu commented 1 year ago

This standard file https://ftp.ensembl.org/pub/release-109/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz It reads fine with bcftools and also I could read records with pyvcf (in Python), so file looks correct.

rasmushenningsson commented 1 year ago

This is one of the lines causing problems:

1   20086175    rs387906795 G   A,T .   .   dbSNP_154;TSA=SNV;E_Freq;E_1000G;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPMed;E_gnomAD;CLIN_uncertain_significance;;MA=T;MAF=0.0002;MAC=1;AA=G;AFR=0,0;AMR=0,0;EAS=0,0;EUR=0,0;SAS=0,0.001

In particular, it has CLIN_uncertain_significance;;MA=T, that is, a double semicolon between two keys.

I would say that this is not allowed according to the VCF specification (see https://samtools.github.io/hts-specs/ for different versions of the spec) and I'm thus reluctant to change the parser.

It's unfortunate that 1000 Genomes provide files that do not follow the spec.

rasmushenningsson commented 1 year ago

Closing this issue since no change is currently planned. Feel free to reopen/comment if you want to continue the discussion.

sdall commented 1 year ago

Thank you for developing this parser. Unfortunately, I encountered a potentially related problem, while parsing the following 1k genomes dataset.

https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

"ERROR: VariantCallFormat.Reader file format error on line 254 ~>";esv3585"

Do you happen to know an easy workaround which does not involve changing the parser?

rasmushenningsson commented 1 year ago

This is another issue, this time it's about the ID column.

In the file, there are lines like this one:

1   2397655 .;esv3585028    C   ...

i.e. the ID looks like this .;esv3585028.

It's a weird ID, because . means missing value, but then the list has one more entry and that one has a value. The correct way to represent this would be to just keep esv3585028.

It's a kind of a grey area when I read the VCF spec, but my interpretation is that this shouldn't be allowed.

My suggestion would be to preprocess the file (e.g. using some linux tool like awk or sed if you are familiar with those) to remove .; from the start of IDs. Would that work for you?

sdall commented 1 year ago

Thank you for helping a non-expert understand the issue.

I planned to use VCFTools.jl convert_gt, but circumvented this problem with the help of R, as I had no spare disk space left to pre-process these files.

Thank you for your time!

rasmushenningsson / VariantCallFormat.jl

1000Genomes VCF not parsing #10