jim22k / sscdf

Binary sparse storage scheme for SuiteSparse::GraphBLAS utilizing netCDF4
Other
2 stars 0 forks source link

zarr support #1

Open ivirshup opened 2 years ago

ivirshup commented 2 years ago

Hey @jim22k!

This is more of a comment than an actual issue or request, but just letting you know you almost have zarr support already.

The netcdf4 C library has a zarr implementation (which they are interested in splitting out). This library can write to that trivially. It reads fine as a zarr store, but I get a segmentation fault if I try to read it with netcdf4...

I did have to remove the zlib kwarg from this line to write it:

https://github.com/jim22k/sscdf/blob/14167c8b6ede9a44d2f61996709062e6e1fc14d3/sscdf.py#L218

Example:

import sscdf
import netCDF4 as nc
import graphblas as gb
import zarr
from scipy import sparse

X = gb.io.from_scipy_sparse(sparse.random(1000, 1000, density=0.1, format='csr'))
ds = nc.Dataset("file:///Users/isaac/tmp/test_zarr.nc#mode=nczarr,file", "w")

sscdf.Writer._save_tensor(ds, X)

z_store = zarr.open("/Users/isaac/tmp/test_zarr.nc")

z_store.tree()
/
 ├── col_indices (100000,) uint64
 ├── indptr (1001,) uint64
 ├── ncols (1,) uint64
 ├── nrows (1,) uint64
 └── values (100000,) float64
dict(z_store.attrs)
{'_NCProperties': 'version=2,netcdf=4.8.1,nczarr=2.0.0',
 'format': 'csr',
 'datatype': 'fp64',
 '_NCZARR_ATTR': {'types': {'_NCProperties': '<U1',
   'format': '<U1',
   'datatype': '<U1'}}}

But it segfaults if I try to read it with netcdf. I have not tried to debug this.

sscdf.read("file:///Users/isaac/tmp/test_zarr.nc#mode=nczarr,file")
[1]    33167 segmentation fault  ipython
/Users/isaac/miniconda3/envs/sscdf/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown                         
  warnings.warn('resource_tracker: There appear to be %d '
jim22k commented 2 years ago

@ivirshup Thanks for trying this out! I replicated the segfault on my computer (Intel Macbook). The segfault happens immediately when netCDF4 tries to read the zarr file, so while the zarr file looks okay, the nczarr implementation breaks when attempting to read it.

After some experimentation, the issue comes from creating Variables that are not tied to Dimensions. netCDF4 allows this use case, and I use dimensionless-variables to store the array dimensions as 1-element arrays. netCDF4+HDF5 seems to handle this fine, so I'm not sure why zarr chokes.

This choice to use dimensionless-variables isn't necessary. I could easily store the dimension sizes in the metadata. I'll update the Writer and Reader so zarr becomes useable.

jim22k commented 2 years ago

Update

The above doesn't apply anymore. The whole library has been revamped based on the binsparse v1.0 spec discussion.

During that work, I attempted to make it work with nczarr. However, I ran into a bug that is a showstopper. Until that gets fixed, I don't think I can make any progress with zarr support.