jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
404 stars 200 forks source link

FILTER line is malformed #337

Open myourshaw opened 2 years ago

myourshaw commented 2 years ago

Background: In FILTER, multiple filters should be separated by semicolons. The widely used, but not actively maintained, VarScan2 genomic variant caller uses commas instead. Moreover, VarScan2 does not add ##FILTER metadata for most of its filters. Picard FixVcfHeader can be used to fix missing FILTER metadata. A "fixed" metadata row will look like: ##FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader">

Error: PyVCF fails with: ` Traceback (most recent call last): File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 236, in main()

File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 232, in main run(parser.parse_args())

File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 166, in run df_1 = vcf_to_dataframe(args.vcf_1)

File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 74, in vcf_to_dataframe vcf_reader = vcf.Reader(open(vcf_file, "r"))

File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 300, in init self._parse_metainfo()

File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 326, in _parse_metainfo key, val = parser.read_filter(line)

File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 142, in read_filter raise SyntaxError(

SyntaxError: One of the FILTER lines is malformed: ##FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader"> `

Issue: It might be more robust for PyVCF to treat a filter with commas as just one big filter name, as does Picard FixVcfHeader. Instead of raising an exception, accept metadata with a filter ID inside double quotes and containing commas, e.g., ID="RefAvgRL,VarAvgRL". Similarly, in the data, treat a FILTER value like RefAvgRL,VarAvgRL as a single entity. I think this solution is consistent with the VCF 4.2 spec for a filter name: String, no whitespace or semicolons permitted.

Possible pull request: This hack (changing [^,] + to .+ worked to get me through an urgent analysis, but it may not be the best solution. At parser.py line 142 ` self.filter_pattern = re.compile(r'''##FILTER=< ID=(?P.+),\s Description="(?P[^"])"

''', re.VERBOSE) `

gdurif commented 2 years ago

I get the same problem, any update on this issue ?

I hoped switching to PyVCF3 (c.f. #335 ) would solve the issue but apparently not.

My bad, in my case the problem originated from a tag Source in a FILTER field:

##FILTER=<ID=xxx,Description="yyy",Source="zzz">

which is a INFO field tag according to https://samtools.github.io/hts-specs/ and not a FILTER field tag.

dridk commented 2 years ago

Please comment this issue on pyvcf3 https://github.com/dridk/PyVCF3/issues/1