single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
86 stars 25 forks source link

[python] TileDB does not allow `""` as the sole string-enum value #2859

Open johnkerl opened 1 month ago

johnkerl commented 1 month ago

Split out from #2858, after initial customer report at #2822.

Here is a repro script: https://gist.github.com/johnkerl/d45f022d710842d36d1b9f29303ce466

Output:

----------------------------------------------------------------
ADATA OBJECT:

AnnData object with n_obs × n_vars = 16 × 4
    obs: 'cell_type'
    var: 'means'

----------------------------------------------------------------
INGESTING TO tiledbsoma-io-empty-string-enum:
Traceback (most recent call last):
  File "/Users/johnkerl/git/TileDB-Inc/cloud-dev-temp/debug/tiledbsoma-nullables/./tiledbsoma-io-write-empty-string-enum.py", line 122, in <module>
    tiledbsoma.io.from_anndata(suri, adata, measurement_name="RNA")
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/io/ingest.py", line 511, in from_anndata
    with _write_dataframe(
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/io/ingest.py", line 1162, in _write_dataframe
    return _write_dataframe_impl(
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/io/ingest.py", line 1235, in _write_dataframe_impl
    _write_arrow_table(
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/io/ingest.py", line 1134, in _write_arrow_table
    handle.write(arrow_table, platform_config=tiledb_write_options)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_dataframe.py", line 466, in write
    clib_dataframe.write(batch, sort_coords or False)
RuntimeError: Enumeration: Unable to extend an enumeration without a data buffer.

Notes:

If the input-data column has been created this way then all is well:

"cell_type": pd.Categorical(np.array([""], dtype=str), categories=[""]),

If the input-data column has been created this way then we get the crash:

"cell_type": pd.Categorical(np.array([""], dtype=str),
johnkerl commented 4 weeks ago

This is s core defect. I've filed [sc-53027] with the TileDB Core team.