`vcf_to_zarr` raises error instead of truncating INFO fields of length `R`

timothymillar commented 8 months ago

In vcf_to_zarr, specifying max_alt_alleles as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with length R (one value for each allele) then this results in an IndexError. Here is a minimal example:

example.vcf with 2 alternate alleles:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20240307
##source=None
##contig=<ID=chr1,length=21898217>
##INFO=<ID=AFP,Number=R,Type=Float,Description="Posterior mean allele frequencies">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE0
chr1    1       .       A       T,G     .       .       AFP=0.8,0.1,01  GT      0/1

convert.py:

from sgkit.io.vcf import vcf_to_zarr

vcf_to_zarr(
    input="example.vcf",
    output="example.zarr",
    max_alt_alleles=1,  # truncate to one alternate allele
    fields=[
        "INFO/AFP", 
        "FORMAT/GT", 
    ]
)

traceback:

``` --------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[8], line 1 ----> 1 vcf_to_zarr( 2 input="example.vcf", 3 output="example.zarr", 4 max_alt_alleles=1, 5 fields=[ 6 "INFO/AFP", 7 "FORMAT/GT", 8 ] 9 ) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1002, in vcf_to_zarr(input, output, target_part_size, regions, chunk_length, chunk_width, compressor, encoding, temp_chunk_length, tempdir, tempdir_storage_options, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length, retain_temp_files) 966 sequential_function = functools.partial( 967 vcf_to_zarr_sequential, 968 output=output, (...) 980 field_defs=field_defs, 981 ) 982 parallel_function = functools.partial( 983 vcf_to_zarr_parallel, 984 output=output, (...) 1000 retain_temp_files=retain_temp_files, 1001 ) -> 1002 process_vcfs( 1003 input, 1004 sequential_function, 1005 parallel_function, 1006 regions=regions, 1007 target_part_size=target_part_size, 1008 ) 1010 # Issue a warning if max_alt_alleles caused data to be dropped 1011 ds = zarr.open(output) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1328, in process_vcfs(input, sequential_function, parallel_function, regions, target_part_size) 1320 regions = [ 1321 partition_into_regions(input, target_part_size=target_part_size) 1322 for input in inputs 1323 ] 1325 if (isinstance(input, str) or isinstance(input, Path)) and ( 1326 regions is None or isinstance(regions, str) 1327 ): -> 1328 return sequential_function(input=input, region=regions) 1329 else: 1330 return parallel_function(input=input, regions=regions) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:516, in vcf_to_zarr_sequential(input, output, region, chunk_length, chunk_width, compressor, encoding, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length) 514 raise ValueError(f"Filter '{f}' is not defined in the header.") 515 for field_handler in field_handlers: --> 516 field_handler.add_variant(i, variant) 518 # Truncate np arrays (if last chunk is smaller than read_chunk_length) 519 if i + 1 < read_chunk_length: File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:293, in InfoAndFormatFieldHandler.add_variant(self, i, variant) 291 try: 292 for j, v in enumerate(val): --> 293 self.array[i, j] = ( 294 v if v is not None else self.missing_value 295 ) 296 except TypeError: # val is a scalar 297 self.array[i, 0] = val IndexError: index 2 is out of bounds for axis 1 with size 2 ```

timothymillar commented 8 months ago

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

jeromekelleher commented 8 months ago

Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.

That's odd - the vcf parsing code doesn't use any numba AFAIK.

timothymillar commented 8 months ago

With an updated dataset (slightly larger), including the length R FORMAT field resulted in ValueError: Codec does not support buffers of > 2147483647 bytes. This led me to https://github.com/zarr-developers/zarr-python/issues/487 and with smaller chunk sizes I was able to convert the VCF. I'm not sure why it was crashing earlier.

sgkit-dev / sgkit

`vcf_to_zarr` raises error instead of truncating INFO fields of length `R` #1210