This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:
parallelized COO to CSR conversion
eager reading of the COO data from SOMA, allowing some concurrent COO-to-CSR work to overlap with the data reading.
In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.
Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).
See also, related PR single-cell-data/TileDB-SOMA#745
This PR addresses the peformance issues noted in single-cell-data/TileDB-SOMA#719. In particular, loading large AnnData requires CSR sparse matrices, which are very slow to convert from COO to CSR. This PR adds a fast-path converter for Python, implementing two changes:
In addition, this change uses substantially less memory for the conversion, allowing larger datasets to be loaded into CSR, and therefore into AnnData.
Before/after benchmarks show 1.5-4X speed-ups when the work fits in RAM (tested on r6i instance types with data on S3). In cases where paging occured, speed-ups were dramatically larger (e.g., 20X).
See also, related PR single-cell-data/TileDB-SOMA#745