[python] `SparseNDArray` bounding box inconsistent -- shape vs. used coordinates

single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.

MIT License

92 stars 25 forks source link

The current python "soma bounding box" for sparse nd array is inconsistent in its semantics in the write path:

for Tensor objects, it utilizes the shape of the tensor to compute the new bounding box. From the code:
```
nr, nc = values.shape
bounding_box = self._compute_bounding_box_metadata([nr - 1, nc - 1])
```
for Table objects, it scans for the max coordinate, and writes the 'used' bounding box. From the code:
```
maxes.append(pacomp.max(coords).as_py())
bounding_box = self._compute_bounding_box_metadata(maxes)
```

This will have different results when the tensor is sparse (empty) in the max coordinates. Code that will otherwise work fine breaks the bounding box -- for example, these two writes produce entirely different results:

m = sparse.coo_matrix(([1,2], ([0,1],[0,1])), shape=(10,10))

# writes bounding box of (0,10),(0,10)
A.write(pa.SparseCOOTensor.from_scipy(m))

# writes bounding box of (0,1),(0,1)
A.write(pa.Table.from_pydict({'soma_data': m.data, 'soma_dim_0': m.row, 'soma_dim_1': m.col})

It is not entirely clear which is correct, and depends on downstream uses. For example if the expectation is that a sparse matrix, with the last few rows set to zero, can be saved and reconstituted with the same shape, this difference will be an issue.

If the semantics are "used coordinates", as implemented in the Table code branch, we would be better off using the underlying nonempty_domain implementation (resolving bug #1969). If the semantics intend shape, then the Table code path needs some means of determining the caller's intended shape.

CC: @mojaveazure @johnkerl

@bkmartinjr thank for filing this.

The semantics of bounding-box we hammered out in several spec meetings and a design doc was definitely mean for intended shape. There were three:

non-empty domain which is the actually used -- e.g. 100x200 but rows 0-3 and 98-99 are completely unoccupied;
bounding box would track the numbers 100 and 200 regardless of whether first/last rows/columns of a sparse array were entirely unoccupied or not;
"shape" as available domain with room for growth later -- maybe billions in each dimension

So if we want to keep bounding box as intended shape we need to provide a way to get those numbers like 100 and 200 actually wired all the way from the user data to the on-disk bounding-box metadata.

And if we want to jettison the bounding box entirely then we just need to use non-empty domain for the performance issue raised in #1933.

Personally I can go either way. But the latter wins in a tie-breaker, for two reasons:

Parsimony -- all else being equal, I'd rather have two shapish accessors than three
Correctness -- the bounding box has correctness issues with simultaneous open-open-write-write-close-close as tracked in #1969

And I do mean "personally" -- I am not a compbio person and will happily defer to whatever a consensus of the compbio people wants.

single-cell-data / TileDB-SOMA

[python] `SparseNDArray` bounding box inconsistent -- shape vs. used coordinates #1971