single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[python] Utilize Arrow schema `pa.field` nullabilities in `DataFrame.create` #2869

Closed johnkerl closed 2 months ago

johnkerl commented 2 months ago

Context

Split out from #2858.

This issue is for TileDB-SOMA Python only. The R situation will be triaged separately, and tasked out (from #2858) if necessary.

Tracking

In tiledbsoma.DataFrame.create, and likewise tiledbsoma.Experiment.add_new_dataframe, the user brings their own Arrow schema. Our task is to respect that as much as possible, and translate that into a TileDB core schema. One of the things to be mapped across is attribute-level nullability.

Unfortunately, there are two different ways for attributes to be marked nullable:

(1) Flags on the attribute (2) Metadata for the attribute

Python example:

import pyarrow as pa

schema1 = pa.schema(
    [
        pa.field("a", pa.int32()),
        pa.field("b", pa.int32(), nullable=False),
        pa.field("c", pa.int32(), nullable=True)
    ]
)
print("SCHEMA1")
print(schema1)

schema2 = pa.schema(
    [
        pa.field("d", pa.int32()),
        pa.field("e", pa.int32(), nullable=False),
        pa.field("f", pa.int32(), nullable=True),
    ],
    metadata={"d": "nullable", "e": "nullable", "f": "nullable"}
)
print()
print("SCHEMA2")
print(schema2)

Output:

$ python arrow-schema-examples.py
SCHEMA1
a: int32
b: int32 not null
c: int32

SCHEMA2
d: int32
e: int32 not null
f: int32
-- schema metadata --
d: 'nullable'
e: 'nullable'
f: 'nullable'

Note that pa.field defaults to nullable: here, fields a and c are both nullable; only b is not. This is indicated by b: int32 not null.

Bug

In our current implementation we make attributes non-nullable only if the metadata option is set.

Fix