single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[python] SparseNDArray read returns incorrect result when one dimension is empty-indexed #1703

Closed bkmartinjr closed 1 year ago

bkmartinjr commented 1 year ago

The SparseNDArray.read method returns incorrect query results when the coords provided are a unique combination of:

Concrete test case below. All three cases should identical results (namely, an empty pa.Table). The third case, which should be identical, instead returns all data in the SparseNDArray. Bug was found in the 1.5 pre-release (2.17), but it may exist in previous versions as well.

Bug was found via incorrect ExperimentAxisQuery.to_anndata results, but I narrowed it down to the SparseNDArray.read behavior;

import tiledbsoma as soma
import pyarrow as pa

def main():
    ctx = soma.SOMATileDBContext(
        tiledb_config={
            "vfs.s3.region": "us-west-2",
            "soma.init_buffer_bytes": 4 * 1024**3,
        }
    )

    with soma.open(
        "s3://cellxgene-data-public/cell-census/2023-09-14/soma/", context=ctx
    ) as census:
        mouse_experiment: soma.Experiment = census["census_data"]["mus_musculus"]
        X = mouse_experiment.ms["RNA"].X["raw"]

        # Case 1 - works correctly, in that it returns an empty result
        var_joinids = slice(None)
        for tbl in X.read((pa.array([], type=pa.int64()), var_joinids)).tables():
            print("Case 1:", len(tbl))

        # Case 2 - works correctly, in that it returns an empty result
        var_joinids = None
        for tbl in X.read((pa.array([], type=pa.int64()), var_joinids)).tables():
            print("Case 2:", len(tbl))

        # Case 3 - FAILS, as it returns _all_ data in the sparse array
        var_joinids = (
            mouse_experiment.ms["RNA"]
            .var.read(
                (slice(None),),
                column_names=["soma_joinid"],
            )
            .concat()
            .column("soma_joinid")
            .combine_chunks()
        )
        for tbl in X.read((pa.array([], type=pa.int64()), var_joinids)).tables():
            print("Case 3:", len(tbl))

if __name__ == "__main__":
    main()

Running it shows that the third case is incorrect (the first two return zero-length results, but the third case does not):

python tmp/query_fail.py 
Case 1: 0
Case 2: 0
Case 3: 536870912
Case 3: 536870912
<snip, for brevity>
Case 3: 536870912
Case 3: 536870912
johnkerl commented 1 year ago

[sc-34502]

johnkerl commented 1 year ago

@bkmartinjr I verified in a Docker image just now that this repros with 1.4.3, i.e. not new to 1.5.0rc0. Also, repros with 1.3.0.

johnkerl commented 1 year ago

Analysis, for the record.


For context, here's what core does (driven by tiledb-py just for ease of operation):

>>> A = tiledb.open('s/v/pbmc3k/ms/RNA/X/data')
>>> A.df[[3],[4]]
   soma_dim_0  soma_dim_1  soma_data
0           3           4   1.572679
>>> A.df[[3],[]]
      soma_dim_0  soma_dim_1  soma_data
0              3           0  -0.285241
1              3           1  -0.281735
2              3           2  -0.052227
3              3           3  -0.484929
4              3           4   1.572679
...          ...         ...        ...
1833           3        1833  -0.097402
1834           3        1834   1.631685
1835           3        1835  -0.119462
1836           3        1836  -0.179120
1837           3        1837  -0.505670
[1838 rows x 3 columns]
>>> A.df[[],[4]]
      soma_dim_0  soma_dim_1  soma_data
0              0           4  -0.544024
1              1           4   0.633951
2              2           4   1.332647
3              3           4   1.572679
4              4           4  -0.333409
...          ...         ...        ...
2633        2633           4  -0.666646
2634        2634           4   1.201865
2635        2635           4   2.193953
2636        2636           4  -0.350005
2637        2637           4  -0.457937
[2638 rows x 3 columns]
>>> A.df[[],[]]
         soma_dim_0  soma_dim_1  soma_data
0                 0           0  -0.171470
1                 0           1  -0.280812
2                 0           2  -0.046677
3                 0           3  -0.475169
4                 0           4  -0.544024
...             ...         ...        ...
4848639        2637        1833  -0.103831
4848640        2637        1834  -0.436880
4848641        2637        1835  -0.078424
4848642        2637        1836  -0.130327
4848643        2637        1837  -0.471338
[4848644 rows x 3 columns]

Note in particular: in core, empty means all.

This seemed familiar to me, & as I looked through the tiledbsoma-cpp code, I realized why: we've been here before. Back in October of last year:

and then

When the data are one-dimensional, then this suffices:

https://github.com/single-cell-data/TileDB-SOMA/blob/1.4.0/libtiledbsoma/src/soma/managed_query.cc#L154-L156

This is because https://github.com/single-cell-data/TileDB-SOMA/blob/1.4.0/libtiledbsoma/src/soma/managed_query.h#L133

Namely:

However, with two-dimensional data, this thing we did in October 2022 isn't sufficient:

So we need to track subarray_range_empty_ on a dim-by-dim basis in order to effectively work around core's empty-means-all semantics -- if any dim has empty range then override core & declare the result-set is empty. The learning now is that core's "empty means all" needs to be applied dim by dim. That's what this #1706 does.