single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
87 stars 25 forks source link

[c++] libtiledbsoma may need a filtering layer #2709

Closed eddelbuettel closed 2 months ago

eddelbuettel commented 3 months ago

Is your feature request related to a problem? Please describe.

We needed to adjust one of the interop tests yesterday because the new C++-based schema creation and writing can miss an 'automagic' cast we get otherwise.

This is because schema creation, and writes, can be separate. The schema clearly defines the layout. But the write can be more ad-hoc as it was in this test. A data.frame was create, and integer values were passes as is commonly done via an expression such as c(10, 20, 30, 42). But to R these a numeric aka double types. The are commonly cast internally but in this case the column was (per the schema) an int one yet the values, ontained via arrow::as_table(dataframeobject) now ad-hoc inferred a new schema (just for this data.frame-to-arrow conversion) based on the payload. So that column became double.

We could request that users do what we did in the test: as.integer(c(10, 20, 30, 42)). But that may not be realistic. R users just don't expect to have to do this. Our signature just says 'arrow table' so it can well be an ad-hoc conversion.

Describe the solution you'd like

The C++ layer may need to inject a casting step.

Describe alternatives you've considered

Forcing user to be more explicit. Doable ... but maybe not realistic / user-friendly?

Additional context

See this commit.