Open timothymillar opened 8 months ago
Possibly a similar bug with a FORMAT field which is length R
. Getting some random crashes which may be due to out of bounds memory access in Numba code.
Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.
That's odd - the vcf parsing code doesn't use any numba AFAIK.
With an updated dataset (slightly larger), including the length R
FORMAT field resulted in ValueError: Codec does not support buffers of > 2147483647 bytes
. This led me to https://github.com/zarr-developers/zarr-python/issues/487 and with smaller chunk sizes I was able to convert the VCF. I'm not sure why it was crashing earlier.
In
vcf_to_zarr
, specifyingmax_alt_alleles
as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with lengthR
(one value for each allele) then this results in an IndexError. Here is a minimal example:example.vcf with 2 alternate alleles:
convert.py:
traceback:
``` --------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[8], line 1 ----> 1 vcf_to_zarr( 2 input="example.vcf", 3 output="example.zarr", 4 max_alt_alleles=1, 5 fields=[ 6 "INFO/AFP", 7 "FORMAT/GT", 8 ] 9 ) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1002, in vcf_to_zarr(input, output, target_part_size, regions, chunk_length, chunk_width, compressor, encoding, temp_chunk_length, tempdir, tempdir_storage_options, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length, retain_temp_files) 966 sequential_function = functools.partial( 967 vcf_to_zarr_sequential, 968 output=output, (...) 980 field_defs=field_defs, 981 ) 982 parallel_function = functools.partial( 983 vcf_to_zarr_parallel, 984 output=output, (...) 1000 retain_temp_files=retain_temp_files, 1001 ) -> 1002 process_vcfs( 1003 input, 1004 sequential_function, 1005 parallel_function, 1006 regions=regions, 1007 target_part_size=target_part_size, 1008 ) 1010 # Issue a warning if max_alt_alleles caused data to be dropped 1011 ds = zarr.open(output) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1328, in process_vcfs(input, sequential_function, parallel_function, regions, target_part_size) 1320 regions = [ 1321 partition_into_regions(input, target_part_size=target_part_size) 1322 for input in inputs 1323 ] 1325 if (isinstance(input, str) or isinstance(input, Path)) and ( 1326 regions is None or isinstance(regions, str) 1327 ): -> 1328 return sequential_function(input=input, region=regions) 1329 else: 1330 return parallel_function(input=input, regions=regions) File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:516, in vcf_to_zarr_sequential(input, output, region, chunk_length, chunk_width, compressor, encoding, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length) 514 raise ValueError(f"Filter '{f}' is not defined in the header.") 515 for field_handler in field_handlers: --> 516 field_handler.add_variant(i, variant) 518 # Truncate np arrays (if last chunk is smaller than read_chunk_length) 519 if i + 1 < read_chunk_length: File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:293, in InfoAndFormatFieldHandler.add_variant(self, i, variant) 291 try: 292 for j, v in enumerate(val): --> 293 self.array[i, j] = ( 294 v if v is not None else self.missing_value 295 ) 296 except TypeError: # val is a scalar 297 self.array[i, 0] = val IndexError: index 2 is out of bounds for axis 1 with size 2 ```