Closed bkmartinjr closed 7 months ago
@bkmartinjr do you have thoughts on the right venue for API proposals? As a starting point I'd suggest you offer a sketch of cellxgene-census needs, and we can iterate from there ... either here, or in a GDoc, or maybe a call next week ...
All we need is public wrappers for a drop-in equivalent of pandas.Index()
and get_indexer()
(i.e., indexer factory and the get_indexer method). So I propose something super simple like (caution: typed in without running/testing):
from _index_util import build_index
class SOMATileDBIndexer:
def __init__(keys: np.typing.NDArray[np.int64], context: SOMATileDBContext):
self._index = build_index(keys, context)
def get_indexer(target: npt.NDArray[np.int64]) -> Any:
return self._index.get_indexer(target)
We could have this subclass somacore.query.types.IndexLike
if we want to expose that protocol,but I don't think it is entirely necessary (I defer to @thetorpedodog )
I don’t think there’s a need to make this its own class—build_index
works as a factory for an IndexLike
object. Exporting that function would suffice (which I think was one of the reasons it got put in a semi–public-facing namespace called utils.py
originally).
one of the reasons it got put in a semi–public-facing namespace
Ah...I completely misunderstood the intent! if we go with this approach, the only thing that would be helpful is access to the IndexLike type, but that is just a nice-to-have.
Any suggestions on naming? Are we OK with the fairly generic build_index
?
Naming I’m not sure about but I think specifying the return as IndexLike
(rather than clib.IntIndexer
) would be helpful in terms of just making the intent obvious. There’s nothing that has to change on the somacore side, since IndexLike
is already in a public namespace and doesn’t really need a top-level export (it’s not something end users will ever likely explicitly use directly; only maybe see as a parameter or return type).
Re: IndexLike - sounds fine.
Naming: propose tiledbsoma.build_index
as I can't think of anything better :-)
@beroy - suggest we just get this into a PR, add to the top-level namespace, and we can hash out the actual name in the PR disucssion.
Here's the PR based here: https://github.com/single-cell-data/TileDB-SOMA/pull/2080 The PR uses the factory based on the latest suggestion by @bkmartinjr. The only difference is that I added two factories at the top in the API: one takes a tiledb context and one takes thread count instead of tile db context (as I thought some users want to directly set the thread counts). LMK what you think and I can go with one if it makes more sense. And also I have no real preference between factory model vs class model.
The new indexer should be exposed via a public API, allowing downstream consumers (e.g., cellxgene-census package) to migrate from the Pandas indexer. Suggest that this be a
tiledbsoma
specific API, not a generic part of SOMA.