[python] Fast CSR loading for `to_anndata`

bkmartinjr commented 1 year ago

This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:

parallelized COO to CSR conversion
eager reading of the COO data from SOMA, allowing some concurrent COO-to-CSR work to overlap with the data reading.

In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.

Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).

See also, related PR single-cell-data/TileDB-SOMA#745

bkmartinjr commented 1 year ago

@thetorpedodog - OK, I think I hit all of your (excellent) feedback.

thetorpedodog commented 1 year ago

Still looks good.

bkmartinjr commented 1 year ago

Related issue with scipy.sparse performance: scipy/scipy#11496

single-cell-data / SOMA

[python] Fast CSR loading for `to_anndata` #83