[python/r] Nullability tracking (parent)

johnkerl commented 1 month ago

Context

This is split out from #2822. #2822 had a couple questions: one was answered there conclusivel, and the other turns out to be multi-faceted. This issue tracks the second.

Also note nullability for all attribute/column types is well-supported in TileDB Core; bugs here are strictly at the TileDB-SOMA level.

Purpose

Characterize and isolate nullability-related issues within TileDB-SOMA.

Individual issues will be split out, prioritized, assigned, and scheduled.

Coverage matrix

What does "null" mean in source data:

Python None, pd.NA, math.nan
R NA
Also there is a one-off bug with "" -- this is not "null" in any sense, but, I'll track it here: #2859. That's labeled a Python PR but the issue may express itself at the R API as well; this needs to be validated.

Surfaces to check:

Are we creating TileDB core schemas with attribute-level nullable=True in all cases where we should
Are we doing the right thing with writing data values

Who writes, and with from what source formats:

Python tiledbsoma.Experiment.add_new_dataframe
- Arrow schema is provided by the caller
- Input data ia Arrow Table
- Issue #2869
- PR #2868
Python tiledbsoma.io.from_anndata/from_h5ad
- Arrow schema computed by TileDB-SOMA
- Input data is Pandas DataFrame
R SOMACollection$add_new_dataframe
- Arrow schema is provided by the caller
- Input data ia Arrow Table
- As far as I can tell so far we're good here when the input Arrow schema is set for nullability -- we carry that across:
- R writer: https://gist.github.com/johnkerl/21eee1ef55a9ad1ecac8a4aa3049e5f2
- R reader: https://gist.github.com/johnkerl/782b7fe2281ddc20a97b5132c0c1545a
- If the Arrow schema is not set to nullable, it looks like we convert NA to 0 on writes -- my hunch is this should throw on the write but we can discuss this -- in particular, on discussion with R users to find what the cultural expectation is in the R community.
R from_seurat
- Arrow schema computed by TileDB-SOMA
- Input data is R lists, vectors, factors, etc. (nominally within Seurat / SingleCellExperiment)

Column types:

String (not enumeration/categorical/factor)
- tiledbsoma.io #2861
- Arrow input (e.g. tiledbsoma.Experiment.add_new_dataframe) -- needs a separate issue
String enumeration/categorical/factor -- including the "" case
- with explicit levels provided in the input data -- see #2860, all appears to be well
- with levels inferred from the input data -- #2860
Boolean
- tiledbsoma.io.from_anndata/from_h5ad: #2857 -- this is handled correctly as described in #2858
- We could definitely use some notebook/example material here for users -- in the tiledbsoma.io.from_anndata case it's crucial that the user's AnnData object has nullable booleans expressed in the right way for Pandas before they hand it to us
Ints
- I'll verify this is as in #2858 but I'll check
Floats
- math.nan should probably stay math.nan (it is a floating-point value) -- although I believe TileDB Core uses math.nan for null-fill (I need to check) so this would be a moot point
- R's NA should probably map to TileDB Core floating-point null

References

Some notes on pyarrow, pandas, and R handling of nulls in categoricals/factors: https://gist.github.com/johnkerl/3e8552ddfc437a52b584ac024313474a
Python readback: https://gist.github.com/johnkerl/8c8020271bec82ff341772e1a26c1606
R readback: https://gist.github.com/johnkerl/782b7fe2281ddc20a97b5132c0c1545a

mdylan2 commented 1 month ago

One question related to the case where the Arrow schema is provided. How would one specify nullable=True for an attribute?

It doesn't seem like the Python API implementation for PlatformConfig supports nullable: https://github.com/single-cell-data/TileDB-SOMA/tree/main/apis/python#platform_config-format.

Repro script:

import tiledbsoma as soma
import tiledb
import os
import shutil
import pyarrow as pa

obs_schema = pa.schema([("soma_joinid", pa.int64()), ("barcode", pa.large_string()) ])
platform_config = {
    "tiledb": {
        "create": {
            "attrs": {
                "barcode": {
                    "filters": [{"_type": "ZstdFilter", "level": 9}],
                    "nullable": True # also tried string "true"                }
            }
        }
    }
}
exp_path = "./test"
exp = soma.Experiment.create(exp_path)
exp.add_new_dataframe(
    "obs",
    schema=obs_schema,
    index_column_names=["soma_joinid"],
    platform_config=platform_config
)
exp.close()

with tiledb.open(os.path.join(exp_path, "obs")) as arr:
    print(arr.schema)

shutil.rmtree(exp_path)

prints

ArraySchema(
  domain=Domain(*[
    Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='barcode', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=9), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

johnkerl commented 1 month ago

Hi @mdylan2 ! Great question! At the moment I'm typing up issues for the Python tiledbsoma.io cases. I don't have a write-up yet for the add_new_dataframe / "bring your own Arrow schema" case, but I do know there is something extra going on there. I've verified that (in a certain case) the core schema created does have the nullability flag set, and I believe there's just an intermediate miswiring. I'll be sure to directly address your case in an upcoming child issue umbrellaed under this parent task. And if we need to surface nullability at the PlatformConfig level, we'll have docs on that as well. And also we need clearer guidance to users on setting nullability at the Arrow-schema level.

johnkerl commented 1 month ago

A bit more info @mdylan2 : re https://gist.github.com/johnkerl/3a7473dc24974bcc47f7b8257a19bbdb

This is in Python
Although I haven't gisted it, I've verified the same in R using R's arrow package
It turns out in Python & R, fields are nullable by default -- you have to explicitly say nullable=False (and ☝️ you haven't)

So this is a bug of ours -- I'll isolate it -- thanks again for the repro script!

johnkerl commented 1 month ago

@mdylan2 found it -- this will be a quick fix -- more tomorrow!

johnkerl commented 1 month ago

@mdylan2 the issue is #2869 with PR #2868. This fix will go out with TileDB-SOMA 1.13.1 (if we do one) or else 1.14.0.

I've now established that the workaround for now is to set metadata like

pa.schema(
    [
        pa.field("x", pa.int32()),
    ],
    metadata={
        "x": "nullable",
    }
)

Please let me know if this resolves everything for you in your add_new_dataframe use-case.

mdylan2 commented 1 month ago

That worked, thank you @johnkerl!

johnkerl commented 1 month ago

@mdylan2 -- update at https://github.com/single-cell-data/TileDB-SOMA/issues/2857#issuecomment-2288980855 -- regarding how to set up nullable booleans (same goes for ints/floats too I believe, & will test explicitly) -- at the point in time when you set up your obs/var Pandas dataframes for using tiledbsoma.io.. This is because the Pandas DataFrame objects we get as input data need to have been constructed following Pandas nullability conventions, and NumPy array-of-int has its own non-masked and masked versions.

For the path using tiledbsoma.Experiment.add_new_dataframe, where you bring your own Arrow schema and Arrow table, there of course the input data we get follows Arrow's nullability conventions.

Also, there's more work to do here -- which I'll track on this current issue -- even for things which are not bugs, involving:

More unit-test coverage for things which we are doing correctly
More document/notebook example material around nullability-handling for TileDB-SOMA's Python and R APIs

single-cell-data / TileDB-SOMA