Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for dataset_type, soma_encoding_version and soma_object_type.
Python creates with Unicode strings (e.g., "dataset_type": "soma")
R package creates objects with byte arrays (e.g., "dataset_type": b"soma")
I also checked directly reading from S3, i.e., not using the tiledb:// URI, and the result is the same.
Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.
Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.
In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.
tiledbsoma.__version__ 1.11.4
TileDB-Py version 0.29.0
TileDB core version (tiledb) 2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version 3.11.9.final.0
OS version Linux 6.8.0-76060800daily20240311-generic
Objects created by R or Python bindings should have identical metadata, but currently the R and Python packages tag SOMA objects with different and incompatible metadata tags for
dataset_type
,soma_encoding_version
andsoma_object_type
."dataset_type": "soma"
)"dataset_type": b"soma"
)Using TileDB-Py to inspect two arrays.
When array created by Python (array info):
When array created by R (array info):
I also checked directly reading from S3, i.e., not using the
tiledb://
URI, and the result is the same.Where a "string" or "byte array" is right, I think it is reasonably clear that there is a bug here - the mandatory metadata tags should be identical no matter which ingestion system is used, and which package is used to read it back.
Side note: the current Python package seems to have a work-around for this, as it detects byte array metadata and converts it to utf-8. This is nice, but doesn't seem like the right answer, as it requires any other user of that metadata (e.g., end-user code) to do the same encoding/decoding step for any/all metadata values.
In my opinion, we should be using utf-8 everywhere (and document that in the SOMA spec), but at a minimum, we should have common behavior across all reader/writer code.