theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Improved support for HDF5 transfer #29

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Closes #13. When converting from AnnData to SCE, if the former has HDF5-backed matrices, these automatically cause the creation of HDF5Arrays (but only when those backed matrices are opened with mode="r"). This avoids any need to load the data into memory prior to the transfer into R.

In addition, there is some slightly better support for the reverse process. I couldn't figure out how to get the AnnData() constructor to accept a h5py.Dataset, but I did load the HDF5 file's contents into a Numpy array directly in Python; this avoids the need to load it in R before transferring to Python and saves a copy.

(Ideally, though, we would have AnnData accept a backed Dataset, as this is the direct analogue for a HDF5Array.)

lazappi commented 3 years ago

I can't get the tests to run locally but I think that might be my setup. I assume you have checked this?

LTLA commented 3 years ago

Should have converted to a draft, actually; can you see if there's any way to stuff in a H5Dataset in there?

LTLA commented 3 years ago

Well, bum. This still closes #13 but I failed at the more ambitious task to slide HDF5Arrays into the AnnData as a h5py.Dataset. Even from within Python itself, assigning a h5py.Dataset to adobj.X, for example, will convert the former into a Numpy array. Poking around the source suggests that the AnnData class itself has some special provisions for H5AD files to support HDF5 backing (e.g., adobj.file), with the implication being that we can't just shove arbitrary HDF5-backed arrays in there. Oh well.

lazappi commented 3 years ago

My brain is dead from EuroBioc so not sure I followed all of that but checks run locally for me so I'll merge.

LTLA commented 3 years ago

Basically, the remaining task is to figure out how to transfer an R-side HDF5Array inside a SCE to a h5py.Dataset inside an AnnData object. The problem is that my attempts to assign a h5py.Dataset to the .X member of an AnnData object have always lead to the realization of the former into a Numpy array, which defeats the point of having HDF5-backed data.