sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
23 stars 5 forks source link

Parsing fails for VCF with GT in header but not in FORMAT field #267

Closed tomwhite closed 2 weeks ago

tomwhite commented 3 weeks ago

Here an example of a failing VCF (generated by hypothesis-vcf while testing https://github.com/sgkit-dev/hypothesis-vcf/pull/3):

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##source=hypothesis-vcf-0.1.dev4+g1f4f9c2
##contig=<ID=0>
##INFO=<ID=A0,Type=Integer,Number=1,Description="Generated field">
##INFO=<ID=B0,Type=Integer,Number=1,Description="Generated field">
##FORMAT=<ID=GT,Type=String,Number=1,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
0   1   .   A   .   .   .   A0=0

The GT field is defined in the header, but it's not used. Note that this is the same case tested by no_genotypes_with_gt_header.vcf in sgkit, which was fixed in https://github.com/sgkit-dev/sgkit/pull/621.

The failure is:

Traceback (most recent call last):
  File "/Users/tom/miniconda3/envs/bio2zarr-3.10/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/tom/workspace/bio2zarr/bio2zarr/vcf2zarr/icf.py", line 1082, in process_partition
    tcw.append("FORMAT/GT", variant.genotype.array())
AttributeError: 'NoneType' object has no attribute 'array'
jeromekelleher commented 3 weeks ago

Nice, good catch hypothesis