Closed jeromekelleher closed 5 days ago
Fixing that problem is easy enough with setting copy=None
. However, we get something more sinister later in the process:
(numpy-2-venv) jk@empire$ python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/x.icf -p0
Scan: 100%|█████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 52.7files/s]
Explode: 100%|███████████████████████████████████████████████████████████████████| 9.00/9.00 [00:00<00:00, 429vars/s]
(numpy-2-venv) jk@empire$ python3 -m bio2zarr vcf2zarr encode tmp/x.icf tmp/x.vcz -f -p0
Encode: 85%|███████████████████████████████████████████████████████████▊ | 792/927 [00:00<00:00, 7.81kB/s]Segmentation fault (core dumped)
(numpy-2-venv) jk@empire$ /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
So, during the encode step we've got a segfault. Digging in to this now.
Hmm, so setting copy=True above resolves this segfault (which appears to happen when accessing the genotype data). So, setting copy=True seems fine and dandy to me - I'm sure the perf difference is negligible.
With #257 we should be basically set for numpy 2.0 and numpy 1.x compatibility. To close this issue we should add a CI job that explicitly installs numpy 2.x and runs the tests. Later, when numpy 2.0 becomes the default thing we install (due to dependencies) we can switch this to 1.x.
We're waiting on numpy 2.0 compatible wheels from msprime, so no point in making this CI job until they arrive.
Just waiting on numpy 2.0 wheels for msprime which should arrive in a few days, and we can then ship a numpy 2.0 compatible version.
This is blocking removal of VCF from sgkit (https://github.com/sgkit-dev/sgkit/pull/1264), since we are using vcztools
there for some compatibility tests and one of the test environments runs on NumPy 2 (https://github.com/sgkit-dev/sgkit/actions/runs/11145218525).
It looks like msprime now has NumPy 2 wheels, so it should be enough to do a bio2zarr release.
I'm slightly reluctant to do a bio2zarr release before having a good look at the local alleles stuff. Can we point sgkit at the development version of bio2zarr for a while, keeping an issue tracking the fact we need to switch before next release?
Can we point sgkit at the development version of bio2zarr for a while
I managed to do this just for the NumPy GitHub action workflow, so I think we're good here.
Some errors: