sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
26 stars 7 forks source link

error converting a vcf #217

Closed mufernando closed 4 months ago

mufernando commented 4 months ago

I have a 75Mb VCF that I tried converting using vcf2zarr CLI.

Following the instructions, I first created the intermediate ICF using

vcf2zarr explode chrY_mcc-746-kray-filtered.vcf.gz chrY_mcc-746-kray-filtered.exploded

which worked out fine.

Then, I tried converting to the final vcfzarr using encode:

vcf2zarr encode chrY_mcc-746-kray-filtered.exploded chrY_mcc-746-kray-filtered.zarr -vv

This is the verbose output I get https://gist.github.com/mufernando/805a39a636c71b21c1b25a8ff6417e49

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/users/rodrigmu/micromamba/envs/marmoset/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr_main())
             ^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/cli.py", line 331, in encode
    vcf2zarr.encode(
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/vcf2zarr/vcz.py", line 934, in encode
    vzw.encode_all_partitions(
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/vcf2zarr/vcz.py", line 898, in encode_all_partitions
    with core.ParallelWorkManager(num_workers, progress_config) as pwm:
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/core.py", line 274, in __exit__
    wait_on_futures(self.futures)
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/core.py", line 97, in wait_on_futures
    raise exception
ValueError: Codec does not support buffers of > 2147483647 bytes

Ideas on what might be going on here? I'm happy to share the VCF.

jeromekelleher commented 4 months ago

Thanks for the bug report @mufernando!

I think the problem is with your call_PL field:

DEBUG Initialised <zarr.core.Array '/call_PL' (24739, 746, 66) int32>

Your PL field leads to an inner dimension of 66! :face_with_open_eyes_and_hand_over_mouth:

So, each variant chunk of this array is 10000 746 66 * 4 which is more than 2147483647. The immediate bug here then is that we're not giving a usable error message. More generally, PL fields are a major problem (#185 and linked discussion) but we know how they should be dealt with.

In the short term, I'd suggest creating a schema and dropping the call_PL field (assuming you're not using it):

vcf2zarr mkschema  chrY_mcc-746-kray-filtered.exploded > schema.json
... edit JSON to remove delete the call_PL field
vcf2zarr encode chrY_mcc-746-kray-filtered.exploded -s schema.json chrY_mcc-746-kray-filtered.zarr
mufernando commented 4 months ago

I confirmed removing the PL field it works! For future reference, you can remove it from the schema using jq:

vcf2zarr explode example.vcf.gz example.exploded
vcf2zarr mkschema example.exploded | jq 'del(.fields[] | select(.name == "call_PL"))' > example.json
vcf2zarr encode example.exploded example.zarr --schema example.json
jeromekelleher commented 4 months ago

I just merged a fix which should raise a more helpful error message @mufernando - would you mind trying it out on your data please? The message will need some links to as-yet unwritten docs about PL fields, but hopefully it's pointing people in the right direction.