single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
87 stars 25 forks source link

Tracking issue for adversarial-stride queries #1743

Open johnkerl opened 11 months ago

johnkerl commented 11 months ago

Use-case:

Tracks [sc-34843]

mlin commented 11 months ago

Thanks @johnkerl

Just providing another concrete example, this one on the var axis:

https://github.com/chanzuckerberg/cellxgene-census/blob/1ccf088bb40670ee42f67dfa9539b04436d6a544/api/python/cellxgene_census/src/cellxgene_census/experimental/ml/huggingface/geneformer_tokenizer.py#L81-L88

(This class inherits ExperimentAxisQuery.) It's interested in approximately 20K of the 60K genes, specifically protein-coding and miRNA genes, selected by soma_joinid. We found the query is about ~25% faster with the var_query coords= commented out, that is retrieving the full expression vectors for all genes and then selecting from that. (That is to say -- appreciably faster, but not by an order of magnitude or anything like that)

Re "adversarial stride" -- I'm not sure how we set up the var soma_joinids and whether there's any id locality in the set of genes selected. There's probably some locality but OTOH I wouldn't be surprised if there's effectively at least one hit in each tile -- so it might be adversarial with an innocent biological basis =) Anyway, we can easily understand why the query would be not faster with the coords=, but it's curious that it's appreciably slower.

cc @pablo-gar @bkmartinjr