single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
92 stars 25 forks source link

[python] `SparseNDArray` bounding box inconsistent -- shape vs. used coordinates #1971

Open bkmartinjr opened 11 months ago

bkmartinjr commented 11 months ago

The current python "soma bounding box" for sparse nd array is inconsistent in its semantics in the write path:

This will have different results when the tensor is sparse (empty) in the max coordinates. Code that will otherwise work fine breaks the bounding box -- for example, these two writes produce entirely different results:

m = sparse.coo_matrix(([1,2], ([0,1],[0,1])), shape=(10,10))

# writes bounding box of (0,10),(0,10)
A.write(pa.SparseCOOTensor.from_scipy(m))

# writes bounding box of (0,1),(0,1)
A.write(pa.Table.from_pydict({'soma_data': m.data, 'soma_dim_0': m.row, 'soma_dim_1': m.col})

It is not entirely clear which is correct, and depends on downstream uses. For example if the expectation is that a sparse matrix, with the last few rows set to zero, can be saved and reconstituted with the same shape, this difference will be an issue.

If the semantics are "used coordinates", as implemented in the Table code branch, we would be better off using the underlying nonempty_domain implementation (resolving bug #1969). If the semantics intend shape, then the Table code path needs some means of determining the caller's intended shape.

CC: @mojaveazure @johnkerl

johnkerl commented 11 months ago

@bkmartinjr thank for filing this.

The semantics of bounding-box we hammered out in several spec meetings and a design doc was definitely mean for intended shape. There were three:

So if we want to keep bounding box as intended shape we need to provide a way to get those numbers like 100 and 200 actually wired all the way from the user data to the on-disk bounding-box metadata.

And if we want to jettison the bounding box entirely then we just need to use non-empty domain for the performance issue raised in #1933.

Personally I can go either way. But the latter wins in a tie-breaker, for two reasons:

And I do mean "personally" -- I am not a compbio person and will happily defer to whatever a consensus of the compbio people wants.

johnkerl commented 7 months ago

Pending #2407