scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
524 stars 149 forks source link

Large number of dataframe columns cause hdf5 write error: Unable to create attribute (object header message is too large) #874

Open brainfo opened 1 year ago

brainfo commented 1 year ago

Minimal code sample (that we can copy&paste without having any data)

Write any anndata with pearson residuals in uns

ad_all.write(filename='output/10x_h5/ad_all_2cello.h5ad')

The pearson_residual_df looks like this, with 38291 rows (obs) and 5000 columns (features) :

{'theta': 100,
 'clip': None,
 'computed_on': 'adata.X',
 'pearson_residuals_df': gene_name                             A2M  AADACL2-AS1      AAK1     ABCA1  \
 barcode                                                                      
 GAACGTTCACACCGAC-1-placenta_81  -1.125285    -1.159130 -3.921314 -2.533474   
 TATACCTGTTAGCTAC-1-placenta_81  -1.091364     3.267127 -1.806667 -2.109586   
 CTCAAGAGTGACTGTT-1-placenta_81  -1.074943    12.272920 -1.948798 -2.735791   
 TTCATTGTCACGAACT-1-placenta_81  -1.098699    -1.131765  3.481171  4.472371   
 TATCAGGCAGCTCATA-1-placenta_81  -1.107734    -1.141064 -0.571775 -2.813671   
 ...                                   ...          ...       ...       ...   
 CACAACATCGGCGATC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
 AGCCAGCGTGCCCAGT-1-placenta_314 -0.097424    -0.100394 -0.366482 -0.256219   
 CCGGTGAGTGTTCGAT-1-placenta_314 -0.110334    -0.113696 -0.414971 -0.290148   
 AGGTCATAGCCTGACC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
 TTTATGCCAAAGGGTC-1-placenta_314 -0.112876    -0.116316 -0.424515 -0.296827 
Unable to create attribute (object header message is too large)

Above error raised while writing key 'pearson_residuals_df' of <class 'h5py._hl.group.Group'> to /

Versions

scanpy==1.9.1 anndata==0.8.0 umap==0.5.2 numpy==1.21.5 scipy==1.8.0 pandas==1.4.1 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.9 pynndescent==0.5.6
ivirshup commented 1 year ago

A similar issue was brought up on the discourse.

An easy way to work around this is to store your data using the zarr format instead of hdf5 (e.g. anndata.read_zarr, anndata.write_zarr).

A better solution will take some effort. Here's some prior discussion from h5py: https://github.com/h5py/h5py/issues/1053. The maximum size for metadata on an hdf5 object can be increased using the H5Pset_attr_phase_change function in the C API. h5py has wrapped this at the cython level, but has not exposed this from the main API (https://github.com/h5py/h5py/pull/1638).

I believe we would need to:

jan-engelmann commented 1 year ago

I have the same issue. The zarr workaround works.

Being able to store sparse matrices with a certain chunksize would be great though. I think that's not possible atm.

Due to this line

Maybe this comment could help.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

ivirshup commented 10 months ago

@selmanozleyen, this is the h5py issue I was talking about. Do you think you could take a look at this?

brainfo commented 10 months ago

If for pearson residue it's feasible to store the pearson_residual_df as a layer and other parameter values in uns?

ivirshup commented 10 months ago

@brainfo, oh, for sure. If it's a cells x genes dataframe I think you could just put it as a numpy array into layers, then call adata.to_df(layer="pearson_residuals") whenever you need the dataframe. I believe this should be zero-copy.