single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
MIT License
80 stars 21 forks source link

R and Python create groups with incompatible SOMA metadata #2698

Open bkmartinjr opened 1 month ago

bkmartinjr commented 1 month ago

Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for dataset_type, soma_encoding_version and soma_object_type.

Using TileDB-Py to inspect two arrays.

When array created by Python (array info):

Out[25]: {'dataset_type': 'soma', 'soma_encoding_version': '1', 'soma_object_type': 'SOMAExperiment'}

When array created by R (array info):

Out[23]: {'dataset_type': b'soma', 'soma_encoding_version': b'1', 'soma_object_type': b'SOMAExperiment'}

I also checked directly reading from S3, i.e., not using the tiledb:// URI, and the result is the same.

Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.

Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.

In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.9.final.0
OS version                          Linux 6.8.0-76060800daily20240311-generic