scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
525 stars 150 forks source link

Backed sparse data and `scipy.sparse.sparray` #1088

Open ilan-gold opened 11 months ago

ilan-gold commented 11 months ago

Writing up a discussion with @ivirshup:

Scipy is moving away from their matrix API towards the array API (see note here). Consequently, or perhaps semi-coincidentally, the internal functions whose signature/title/behavior on which we rely is going to be even more in flux (e.g., _get_intXslice in backed mode classes).

For this reason it perhaps makes sense to move away from inheriting from scipy's internal classes and towards our own wrappers. While this may sound like more code maintenance, two things will probably put a finite limit on how much we have to do:

  1. The current method implementations are not too complicated and do not themselves really rely on internal functionality to scipy. For this reason, it should be fairly easy to abandon the inheritance structure and then just throw errors on the other methods for which scipy provides implementations that we likely use i.e., getSliceXInt on csr_matrix. People probably should not be doing this in backed mode anyway, and so we can probably safely throw an error.
  2. These classes are fundamentally internal, themselves returning xxx_matrix classes directly from scipy at the moment. Thus throwing an error or changing behavior should not be too much of an issue, with the exception of the return type.

Therefore moving forward we should probably investigate

ivirshup commented 11 months ago

Previous discussion of scipy.sparse.*_array classes:

Inheritance

I think there are other good reasons to move away from inheritance for these functions. The code in scipy is inflexible, and assumes in-memory objects. This generally does not fit our usage patterns and can lead to very poor performance (e.g. loading an entire array into memory).

Changing the return types of our current/future internal sparse matrix classes (i.e., not XXXDataset) to be array instead of matrix

The relevant changes in sparse array semantics are:

There's a question of compatibility with current support. I would suggest that most of the behavior is the same, so we could wrap the new implementation in a class that returns matrices, and implements 1d indexing.

ivirshup commented 10 months ago

One other point about returning different array types, we'd ideally also like to be able to return cupyx.scipy.sparse arrays.

Maybe we need something like zarrs meta_array here?

gjhuizing commented 7 months ago

Hi, it seems sparse arrays are deprecating .getnnz which is used in Scanpy and Muon. If moving to sparse arrays is on the roadmap for AnnData it could be worth checking it doesn't break things there! https://github.com/scverse/scanpy/issues/2773

ivirshup commented 4 months ago

As an update, I believe getnnz is being un-deprecated as there hasn't been a replacement proposed.