single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
70 stars 9 forks source link

[python] Fast CSR loading for `to_anndata` #83

Closed bkmartinjr closed 1 year ago

bkmartinjr commented 1 year ago

This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:

In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.

Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).

See also, related PR single-cell-data/TileDB-SOMA#745

bkmartinjr commented 1 year ago

@thetorpedodog - OK, I think I hit all of your (excellent) feedback.

thetorpedodog commented 1 year ago

Still looks good.

bkmartinjr commented 1 year ago

Related issue with scipy.sparse performance: scipy/scipy#11496