single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
86 stars 25 forks source link

[python/r] Nullability tracking (parent) #2858

Open johnkerl opened 1 month ago

johnkerl commented 1 month ago

Context

This is split out from #2822. #2822 had a couple questions: one was answered there conclusivel, and the other turns out to be multi-faceted. This issue tracks the second.

Also note nullability for all attribute/column types is well-supported in TileDB Core; bugs here are strictly at the TileDB-SOMA level.

Purpose

Characterize and isolate nullability-related issues within TileDB-SOMA.

Individual issues will be split out, prioritized, assigned, and scheduled.

Coverage matrix

What does "null" mean in source data:

Surfaces to check:

Who writes, and with from what source formats:

Column types:

References

mdylan2 commented 1 month ago

One question related to the case where the Arrow schema is provided. How would one specify nullable=True for an attribute?

It doesn't seem like the Python API implementation for PlatformConfig supports nullable: https://github.com/single-cell-data/TileDB-SOMA/tree/main/apis/python#platform_config-format.

Repro script:

import tiledbsoma as soma
import tiledb
import os
import shutil
import pyarrow as pa

obs_schema = pa.schema([("soma_joinid", pa.int64()), ("barcode", pa.large_string()) ])
platform_config = {
    "tiledb": {
        "create": {
            "attrs": {
                "barcode": {
                    "filters": [{"_type": "ZstdFilter", "level": 9}],
                    "nullable": True # also tried string "true"                }
            }
        }
    }
}
exp_path = "./test"
exp = soma.Experiment.create(exp_path)
exp.add_new_dataframe(
    "obs",
    schema=obs_schema,
    index_column_names=["soma_joinid"],
    platform_config=platform_config
)
exp.close()

with tiledb.open(os.path.join(exp_path, "obs")) as arr:
    print(arr.schema)

shutil.rmtree(exp_path)

prints

ArraySchema(
  domain=Domain(*[
    Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='barcode', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=9), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)
johnkerl commented 1 month ago

Hi @mdylan2 ! Great question! At the moment I'm typing up issues for the Python tiledbsoma.io cases. I don't have a write-up yet for the add_new_dataframe / "bring your own Arrow schema" case, but I do know there is something extra going on there. I've verified that (in a certain case) the core schema created does have the nullability flag set, and I believe there's just an intermediate miswiring. I'll be sure to directly address your case in an upcoming child issue umbrellaed under this parent task. And if we need to surface nullability at the PlatformConfig level, we'll have docs on that as well. And also we need clearer guidance to users on setting nullability at the Arrow-schema level.

johnkerl commented 1 month ago

A bit more info @mdylan2 : re https://gist.github.com/johnkerl/3a7473dc24974bcc47f7b8257a19bbdb

So this is a bug of ours -- I'll isolate it -- thanks again for the repro script!

johnkerl commented 1 month ago

@mdylan2 found it -- this will be a quick fix -- more tomorrow!

johnkerl commented 1 month ago

@mdylan2 the issue is #2869 with PR #2868. This fix will go out with TileDB-SOMA 1.13.1 (if we do one) or else 1.14.0.

I've now established that the workaround for now is to set metadata like

pa.schema(
    [
        pa.field("x", pa.int32()),
    ],
    metadata={
        "x": "nullable",
    }
)

Please let me know if this resolves everything for you in your add_new_dataframe use-case.

mdylan2 commented 1 month ago

That worked, thank you @johnkerl!

johnkerl commented 1 month ago

@mdylan2 -- update at https://github.com/single-cell-data/TileDB-SOMA/issues/2857#issuecomment-2288980855 -- regarding how to set up nullable booleans (same goes for ints/floats too I believe, & will test explicitly) -- at the point in time when you set up your obs/var Pandas dataframes for using tiledbsoma.io.. This is because the Pandas DataFrame objects we get as input data need to have been constructed following Pandas nullability conventions, and NumPy array-of-int has its own non-masked and masked versions.

For the path using tiledbsoma.Experiment.add_new_dataframe, where you bring your own Arrow schema and Arrow table, there of course the input data we get follows Arrow's nullability conventions.

Also, there's more work to do here -- which I'll track on this current issue -- even for things which are not bugs, involving: