single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
84 stars 25 forks source link

[Bug][Python] Sparse array read with `result_order` is slow #2687

Open bkmartinjr opened 3 months ago

bkmartinjr commented 3 months ago

The SOMASparseNdArray.read with result_order="row-major" is unexpectedly slow -- it is roughly 2X slower than calling read() (without sort), and then using PyArrow's sort_by method to perform the sort.

I would naively expect the TileDB-SOMA implementation to be faster as it is multi-threaded (Arrow is a single-threaded sort), or at worst they would be similar in speed.

Example, running on an EC2 instance in the same region as the S3 bucket:

In [10]: import tiledbsoma as soma

In [11]: E = soma.open("s3://tiledb-bruce/tmp_data/soma/ef220f25-dc26-40d9-98de-7e137d2e1803", context=soma.SOMATileDBContext(tiledb_config={'vfs.s3.region':'us-west-2'}))

In [12]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 27s, sys: 1min 40s, total: 3min 8s
Wall time: 2min 41s
Out[12]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]

In [13]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 26s, sys: 1min 34s, total: 3min 1s
Wall time: 2min 37s
Out[13]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]

In [14]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 53s, sys: 9min 14s, total: 19min 7s
Wall time: 5min 47s
Out[14]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]

In [15]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 59s, sys: 8min 42s, total: 18min 41s
Wall time: 5min 4s
Out[15]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]

Versions (please complete the following information):

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.9.final.0
OS version                          Linux 6.8.0-1009-aws
johnkerl commented 1 month ago

[sc-51538]