Open johnkerl opened 1 month ago
One question related to the case where the Arrow schema is provided. How would one specify nullable=True
for an attribute?
It doesn't seem like the Python API implementation for PlatformConfig
supports nullable: https://github.com/single-cell-data/TileDB-SOMA/tree/main/apis/python#platform_config-format.
Repro script:
import tiledbsoma as soma
import tiledb
import os
import shutil
import pyarrow as pa
obs_schema = pa.schema([("soma_joinid", pa.int64()), ("barcode", pa.large_string()) ])
platform_config = {
"tiledb": {
"create": {
"attrs": {
"barcode": {
"filters": [{"_type": "ZstdFilter", "level": 9}],
"nullable": True # also tried string "true" }
}
}
}
}
exp_path = "./test"
exp = soma.Experiment.create(exp_path)
exp.add_new_dataframe(
"obs",
schema=obs_schema,
index_column_names=["soma_joinid"],
platform_config=platform_config
)
exp.close()
with tiledb.open(os.path.join(exp_path, "obs")) as arr:
print(arr.schema)
shutil.rmtree(exp_path)
prints
ArraySchema(
domain=Domain(*[
Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
]),
attrs=[
Attr(name='barcode', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=9), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=100000,
sparse=True,
allows_duplicates=False,
)
Hi @mdylan2 ! Great question! At the moment I'm typing up issues for the Python tiledbsoma.io
cases. I don't have a write-up yet for the add_new_dataframe
/ "bring your own Arrow schema" case, but I do know there is something extra going on there. I've verified that (in a certain case) the core schema created does have the nullability flag set, and I believe there's just an intermediate miswiring. I'll be sure to directly address your case in an upcoming child issue umbrellaed under this parent task. And if we need to surface nullability at the PlatformConfig
level, we'll have docs on that as well. And also we need clearer guidance to users on setting nullability at the Arrow-schema level.
A bit more info @mdylan2 : re https://gist.github.com/johnkerl/3a7473dc24974bcc47f7b8257a19bbdb
arrow
packagenullable=False
(and ☝️ you haven't)So this is a bug of ours -- I'll isolate it -- thanks again for the repro script!
@mdylan2 found it -- this will be a quick fix -- more tomorrow!
@mdylan2 the issue is #2869 with PR #2868. This fix will go out with TileDB-SOMA 1.13.1 (if we do one) or else 1.14.0.
I've now established that the workaround for now is to set metadata like
pa.schema(
[
pa.field("x", pa.int32()),
],
metadata={
"x": "nullable",
}
)
Please let me know if this resolves everything for you in your add_new_dataframe
use-case.
That worked, thank you @johnkerl!
@mdylan2 -- update at https://github.com/single-cell-data/TileDB-SOMA/issues/2857#issuecomment-2288980855 -- regarding how to set up nullable booleans (same goes for ints/floats too I believe, & will test explicitly) -- at the point in time when you set up your obs
/var
Pandas dataframes for using tiledbsoma.io.
. This is because the Pandas DataFrame objects we get as input data need to have been constructed following Pandas nullability conventions, and NumPy array-of-int has its own non-masked and masked versions.
For the path using tiledbsoma.Experiment.add_new_dataframe
, where you bring your own Arrow schema and Arrow table, there of course the input data we get follows Arrow's nullability conventions.
Also, there's more work to do here -- which I'll track on this current issue -- even for things which are not bugs, involving:
Context
This is split out from #2822. #2822 had a couple questions: one was answered there conclusivel, and the other turns out to be multi-faceted. This issue tracks the second.
Also note nullability for all attribute/column types is well-supported in TileDB Core; bugs here are strictly at the TileDB-SOMA level.
Purpose
Characterize and isolate nullability-related issues within TileDB-SOMA.
Individual issues will be split out, prioritized, assigned, and scheduled.
Coverage matrix
What does "null" mean in source data:
None
,pd.NA
,math.nan
NA
""
-- this is not "null" in any sense, but, I'll track it here: #2859. That's labeled a Python PR but the issue may express itself at the R API as well; this needs to be validated.Surfaces to check:
nullable=True
in all cases where we shouldWho writes, and with from what source formats:
tiledbsoma.Experiment.add_new_dataframe
tiledbsoma.io.from_anndata
/from_h5ad
SOMACollection$add_new_dataframe
NA
to 0 on writes -- my hunch is this should throw on thewrite
but we can discuss this -- in particular, on discussion with R users to find what the cultural expectation is in the R community.from_seurat
Column types:
tiledbsoma.io
#2861tiledbsoma.Experiment.add_new_dataframe
) -- needs a separate issue""
casetiledbsoma.io.from_anndata/from_h5ad
: #2857 -- this is handled correctly as described in #2858tiledbsoma.io.from_anndata
case it's crucial that the user's AnnData object has nullable booleans expressed in the right way for Pandas before they hand it to usmath.nan
should probably staymath.nan
(it is a floating-point value) -- although I believe TileDB Core usesmath.nan
for null-fill (I need to check) so this would be a moot pointNA
should probably map to TileDB Core floating-point nullReferences