single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[python/r] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration #1988

Closed bkmartinjr closed 6 months ago

bkmartinjr commented 10 months ago

I have an empty dataframe containing dictionary/enum attributes. When a value filter / query condition is applied to it, it triggers an internal Arrow error. It should return an empty result. All works fine for non-dictionary attributes, so it appears that value filters do not always work correctly with dict/enum attributes.

Note that the tiledb package also has questionable behavior here, returning an exception if the value filter attempts to test for a value not in the enumeration. So it is likely that the Arrow error is unique to the libtiledbsoma codepath, but both behaviors make the combination of filters and enums problematic.

What I think should happen: the value filter should have identical behavior (ie., results) for a column of type "T" and a column of type "enum-of-T", where T is string, int, etc (e.g., a query against a "dict of strings" column should perform the same as a query against a string column).

\<late edit> The empty dataframe is unrelated. It fails in exactly the same way for non-empty arrays. I'll add an example of that below. \</late edit>

The schema (abbreviated for ease of reading):

In [102]: obs.schema
Out[102]: 
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
# lots of other columns removed for brevity

Reading the entire thing works correctly (output abbreviated):

In [103]: obs.read().concat()
Out[103]: 
pyarrow.Table
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
----
soma_joinid: [[]]
dataset_id: [  -- dictionary:
[]  -- indices:
[]]
assay: [  -- dictionary:
[]  -- indices:
[]]
...

Read with a value filter on a string attribute works fine (output abbreviated):

In [104]: obs.read(value_filter="observation_joinid == 'foobar'").concat()
Out[104]: 
pyarrow.Table
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
----
soma_joinid: [[]]
dataset_id: [  -- dictionary:
[]  -- indices:
[]]
...

Reading with a value filter on a dict column fails an internal Arrow error check:

In [105]: obs.read(value_filter="""dataset_id == 'foobar'""").concat()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[105], line 1
----> 1 obs.read(value_filter="""dataset_id == 'foobar'""").concat()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:72, in TableReadIter.concat(self)
     70 def concat(self) -> pa.Table:
     71     """Concatenate remainder of iterator, and return as a single `Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_"""
---> 72     return pa.concat_tables(self)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:5233, in pyarrow.lib.concat_tables()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:68, in TableReadIter.__next__(self)
     67 def __next__(self) -> pa.Table:
---> 68     return next(self._reader)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:454, in _arrow_table_reader(sr)
    452 def _arrow_table_reader(sr: clib.SOMAArray) -> Iterator[pa.Table]:
    453     """Private. Simple Table iterator on any Array"""
--> 454     tbl = sr.read_next()
    455     while tbl is not None:
    456         yield tbl

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3986, in pyarrow.lib.Table.from_arrays()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3266, in pyarrow.lib.Table.validate()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named dataset_id expected length 2097152 but got length 16777216

Using the latest tiledb has a different (and also arguably incorrect) behavior:

In [108]: A = tiledb.open("tmp/census/2023-12-15/soma/census_data/mus_musculus/obs")

In [109]: A.query(use_arrow=True).df[:]
Out[109]: 
Empty DataFrame
Columns: [soma_joinid, dataset_id, assay, assay_ontology_term_id, cell_type, cell_type_ontology_term_id, development_stage, development_stage_ontology_term_id, disease, disease_ontology_term_id, donor_id, is_primary_data, observation_joinid, self_reported_ethnicity, self_reported_ethnicity_ontology_term_id, sex, sex_ontology_term_id, suspension_type, tissue, tissue_ontology_term_id, tissue_type, tissue_general, tissue_general_ontology_term_id, raw_sum, nnz, raw_mean_nnz, raw_variance_nnz, n_measured_vars]
Index: []

In [110]: A.query(cond="dataset_id == 'foobar'", use_arrow=True).df[:]
---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
Cell In[110], line 1
----> 1 A.query(cond="dataset_id == 'foobar'", use_arrow=True).df[:]

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:256, in _BaseIndexer.__getitem__(self, idx)
    254     self.subarray = Subarray(self.array)
    255     self._set_ranges(idx)
--> 256 return self if self.return_incomplete else self._run_query()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:399, in DataFrameIndexer._run_query(self)
    396 import pyarrow
    398 if self.pyquery is not None:
--> 399     self.pyquery.submit()
    401 if self.pyquery is None:
    402     df = pandas.DataFrame(self._empty_results)

TileDBError: TileDB internal: Enumeration value not found for field 'dataset_id'

Package version info:

tiledbsoma.__version__        1.6.1
TileDB-Py tiledb.version()    (0, 24, 0)
TileDB core version           2.18.2
libtiledbsoma version()       libtiledb=2.18.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1017-aws

I can make the problematic empty dataframe available if helpful.


The empty/non-empty state of the array is unrelated. Here is an example on a non-empty dataframe with the same schema, failing in the same way:

In [6]: obs.count
Out[6]: 31470

In [7]: obs.read(value_filter="""dataset_id == 'foobar'""").concat()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[7], line 1
----> 1 obs.read(value_filter="""dataset_id == 'foobar'""").concat()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:72, in TableReadIter.concat(self)
     70 def concat(self) -> pa.Table:
     71     """Concatenate remainder of iterator, and return as a single `Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_"""
---> 72     return pa.concat_tables(self)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:5233, in pyarrow.lib.concat_tables()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:68, in TableReadIter.__next__(self)
     67 def __next__(self) -> pa.Table:
---> 68     return next(self._reader)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:454, in _arrow_table_reader(sr)
    452 def _arrow_table_reader(sr: clib.SOMAArray) -> Iterator[pa.Table]:
    453     """Private. Simple Table iterator on any Array"""
--> 454     tbl = sr.read_next()
    455     while tbl is not None:
    456         yield tbl

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3986, in pyarrow.lib.Table.from_arrays()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3266, in pyarrow.lib.Table.validate()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named dataset_id expected length 2097152 but got length 16777216
bkmartinjr commented 10 months ago

Also see TileDB-Inc/TileDB-Py#1880 which is a related ease-of-use issue for our use case. For many of our dataframe columns, where we want to use enums, it would be far easier to use if the value filter equality ops (==, in [...], etc) worked on enums/dicts.

johnkerl commented 10 months ago

Needs triaging for R as well

johnkerl commented 9 months ago

[sc-38450]

johnkerl commented 9 months ago

@eddelbuettel this needs triaging for R as well please

ryan-williams commented 7 months ago

2299 added this test, which verifies the issue no longer exists (as of TileDB 2.21.0).

Not sure if there is independent verification that still needs to happen in R…

johnkerl commented 7 months ago

@ryan-williams there is independent verification in R. I'll do that. This PR is for Python and that's fine.

johnkerl commented 7 months ago

I am blocked on the R side. Questions in Slack.

johnkerl commented 6 months ago

See also #2311 for tracking toward 1.9

johnkerl commented 6 months ago

I am blocked on the R side. Questions in Slack.

@mojaveazure has set me up! :)

johnkerl commented 6 months ago

Closed with #2308 and #2316