rasmushenningsson / VariantCallFormat.jl

Read and write VCF and BCF files
Other
13 stars 3 forks source link

Weird headers throwing error #11

Closed biona001 closed 1 year ago

biona001 commented 1 year ago

Hi,

I received some (G)VCF files where lines 16-19 of the header lines look like

##GVCFBlock0-20=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-100=minGQ=40(inclusive),maxGQ=100(exclusive)

which causes the following error

using VariantCallFormat
file = "MH0289561.v1.1a091483-abb5-4bb4-b14a-b5e4046d0a84.rb.g.vcf"
reader = VCF.Reader(open(file))

ERROR: VariantCallFormat.Reader file format error on line 16
Stacktrace:
 [1] error(::String, ::Int64)
   @ Base ./error.jl:42
 [2] _readheader!(reader::VariantCallFormat.Reader, state::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{IOStream}})
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:106
 [3] readheader!(reader::VariantCallFormat.Reader)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:80
 [4] Reader
   @ ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:7 [inlined]
 [5] VariantCallFormat.Reader(input::IOStream)
   @ VariantCallFormat ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:20
 [6] top-level scope
   @ REPL[7]:1

However the following works (deleting the - in the key name)

##GVCFBlock020=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock2030=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock3040=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40100=minGQ=40(inclusive),maxGQ=100(exclusive)

Could you consider changing the behavior of your package? I'm not sure if including the - in the header invalidates the VCF spec (this is v4.2), however.

rasmushenningsson commented 1 year ago

Hi!

The specification does not explicitly state what characters are allowed for the keys. It seems reasonable to me to support arbitrary UTF8 keys (only disallowing = in the key name, since that separates the key from the value).

For personal reasons, I don't have much time this week. But I hope to take a look at the implementation early next week.

rasmushenningsson commented 1 year ago

Hi, sorry for slow response times. I looked into this quickly and it's slightly more tricky to fix than I first imagined to solve it as generally as I want (UTF8).

Probably a good step forward would be to support more characters (including -), but not aim for UTF8 yet. Would that still be useful for you?

biona001 commented 1 year ago

Hi, thanks for getting back! That'll definitely be useful!

Currently I just delete the extra - whenever I need to work with these files, but it might be a good idea to add support for at least - since I think GVCF files are rather common? Other people may encounter the same issue.

tecosaur commented 1 year ago

For reference, I'm currently one of these "other people" experiencing this issue. Just adding - and + would be appreciated.

rasmushenningsson commented 1 year ago

Sorry for keeping you all waiting. I have not forgotten about this. Just overwhelmed with other work. I hope I can fix it soon.

rasmushenningsson commented 1 year ago

I have added support for - and + in header tags and dict keys. A new release (v0.5.5) is currently being registered.

If you run into any problems, please reopen this issue.