single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[python] Add public API for new reindexer #2067

Closed bkmartinjr closed 7 months ago

bkmartinjr commented 8 months ago

The new indexer should be exposed via a public API, allowing downstream consumers (e.g., cellxgene-census package) to migrate from the Pandas indexer. Suggest that this be a tiledbsoma specific API, not a generic part of SOMA.

johnkerl commented 8 months ago

@bkmartinjr do you have thoughts on the right venue for API proposals? As a starting point I'd suggest you offer a sketch of cellxgene-census needs, and we can iterate from there ... either here, or in a GDoc, or maybe a call next week ...

bkmartinjr commented 8 months ago

All we need is public wrappers for a drop-in equivalent of pandas.Index() and get_indexer() (i.e., indexer factory and the get_indexer method). So I propose something super simple like (caution: typed in without running/testing):

from _index_util import build_index

class SOMATileDBIndexer:
    def __init__(keys: np.typing.NDArray[np.int64], context: SOMATileDBContext):
        self._index = build_index(keys, context)

    def get_indexer(target: npt.NDArray[np.int64]) -> Any:
        return self._index.get_indexer(target)

We could have this subclass somacore.query.types.IndexLike if we want to expose that protocol,but I don't think it is entirely necessary (I defer to @thetorpedodog )

thetorpedodog commented 8 months ago

I don’t think there’s a need to make this its own class—build_index works as a factory for an IndexLike object. Exporting that function would suffice (which I think was one of the reasons it got put in a semi–public-facing namespace called utils.py originally).

bkmartinjr commented 8 months ago

one of the reasons it got put in a semi–public-facing namespace

Ah...I completely misunderstood the intent! if we go with this approach, the only thing that would be helpful is access to the IndexLike type, but that is just a nice-to-have.

Any suggestions on naming? Are we OK with the fairly generic build_index?

thetorpedodog commented 8 months ago

Naming I’m not sure about but I think specifying the return as IndexLike (rather than clib.IntIndexer) would be helpful in terms of just making the intent obvious. There’s nothing that has to change on the somacore side, since IndexLike is already in a public namespace and doesn’t really need a top-level export (it’s not something end users will ever likely explicitly use directly; only maybe see as a parameter or return type).

bkmartinjr commented 8 months ago

Re: IndexLike - sounds fine.

Naming: propose tiledbsoma.build_index as I can't think of anything better :-)

@beroy - suggest we just get this into a PR, add to the top-level namespace, and we can hash out the actual name in the PR disucssion.

beroy commented 8 months ago

Here's the PR based here: https://github.com/single-cell-data/TileDB-SOMA/pull/2080 The PR uses the factory based on the latest suggestion by @bkmartinjr. The only difference is that I added two factories at the top in the API: one takes a tiledb context and one takes thread count instead of tile db context (as I thought some users want to directly set the thread counts). LMK what you think and I can go with one if it makes more sense. And also I have no real preference between factory model vs class model.

beroy commented 7 months ago

https://github.com/single-cell-data/TileDB-SOMA/pull/2080