The SOMASparseNdArray.read with result_order="row-major" is unexpectedly slow -- it is roughly 2X slower than calling read() (without sort), and then using PyArrow's sort_by method to perform the sort.
I would naively expect the TileDB-SOMA implementation to be faster as it is multi-threaded (Arrow is a single-threaded sort), or at worst they would be similar in speed.
Example, running on an EC2 instance in the same region as the S3 bucket:
the first two (12 and 13) are unsorted read, folllowed by Arrow Table sort - approx 2:40
the latter two (14 and 15) are read(result_order='row-major') - approx 5:00
In [10]: import tiledbsoma as soma
In [11]: E = soma.open("s3://tiledb-bruce/tmp_data/soma/ef220f25-dc26-40d9-98de-7e137d2e1803", context=soma.SOMATileDBContext(tiledb_config={'vfs.s3.region':'us-west-2'}))
In [12]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 27s, sys: 1min 40s, total: 3min 8s
Wall time: 2min 41s
Out[12]:
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]
In [13]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 26s, sys: 1min 34s, total: 3min 1s
Wall time: 2min 37s
Out[13]:
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]
In [14]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 53s, sys: 9min 14s, total: 19min 7s
Wall time: 5min 47s
Out[14]:
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]
In [15]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 59s, sys: 8min 42s, total: 18min 41s
Wall time: 5min 4s
Out[15]:
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]
Versions (please complete the following information):
tiledbsoma.__version__ 1.11.4
TileDB-Py version 0.29.0
TileDB core version (tiledb) 2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version 3.11.9.final.0
OS version Linux 6.8.0-1009-aws
The SOMASparseNdArray.read with
result_order="row-major"
is unexpectedly slow -- it is roughly 2X slower than callingread()
(without sort), and then using PyArrow'ssort_by
method to perform the sort.I would naively expect the TileDB-SOMA implementation to be faster as it is multi-threaded (Arrow is a single-threaded sort), or at worst they would be similar in speed.
Example, running on an EC2 instance in the same region as the S3 bucket:
read(result_order='row-major')
- approx 5:00Versions (please complete the following information):