single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
MIT License
79 stars 21 forks source link

[Feature request] `ExperimentAxisQuery.to_anndata()` should drop unused categories on axis data frames #2765

Open pablo-gar opened 1 week ago

pablo-gar commented 1 week ago

Is your feature request related to a problem? Please describe. Some analytical pipelines, specially those that relate to visualizations rely on the categories of pandas.Categorical. In the case of large SOMAExperiments, many times a query will result on unused categories for potentially many columns of obs or var, thus the user needs to always iterate on all columns and perform a cat.remove_unused_categories() operation.

See for example this reproducible example

import cellxgene_census
import scanpy as sc
census = cellxgene_census.open_soma(census_version="2024-05-20")

human = census["census_data"]["homo_sapiens"]
query = human.axis_query(
    measurement_name = "RNA",
    obs_query = tiledbsoma.AxisQuery(
        value_filter = "tissue == 'tongue' and is_primary_data == True"
    )
)

adata = query.to_anndata(column_names={"obs": ["tissue"]}, X_name = "normalized")
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color="tissue")

Only one "tissue" was selected but all hundreds of tissues are drawn in the umap

image

Describe the solution you'd like ExperimentAxisQuery.to_anndata() returns an anndata with unused categories already removed in the axis data frames