Idea: `__dataframe__` interchange protocol for anndata

ivirshup commented 1 year ago

Please describe your wishes and possible alternatives to achieve the desired result.

https://data-apis.org/dataframe-protocol/latest/index.html

It could be nice if AnnData supported the __dataframe__ interchange protocol, especially when used by libraries which will use the select_columns_by_name, get_column_by_name interfaces.

Use-case: plotting

The biggest use case is plotting. Both seaborn (https://github.com/mwaskom/seaborn/pull/3369) and altair (https://github.com/altair-viz/altair/pull/2888) now support inputs in the dataframe protocol.

In scanpy we typically use the sc.get.obs_df method to create a dataframe for plotting. A major painpoint for this in analysis code is that the user has to provide the keys they want to plot multiple times, once for creating the dataframe, and again to the plotting interface. Instead of having to do:

sns.jointplot(
    data=sc.get.obs_df(adata, ["log1p_total_counts", "pct_counts_mito", "batch"]),
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

It could eventually be:

sns.jointplot(
    data=adata,  # Likely something more like `DFInterface(adata, dim="obs", layer=...)` for now
    x="log1p_total_counts",
    y="pct_counts_mito",
    hue="batch",
)

This should also work for plots of gene expression values, especially if the underlying plotting library selects columns through the dataframe interface and the matrix was stored as CSC or dense.

This could even be a nice interface to on-disk data, especially when X/ layers is stored in CSC.

Some more detail

For dataframe interface for observations, available columns are a union of .obs.columns, var_names, keys like obsm/pca/0.
We should be able to pick an alias for var_names
We should be able to choose which layer is being accessed

Implementation

I think it would make sense for this to start out as POC outside of the main implementation. It may require pyarrow as a dependency to work. In theory pyarrow be a dependency of pandas v3 early next year, so may not be an issue.

cc: @ilan-gold

ivirshup commented 1 year ago

Very rough proof of concept:

```python import pandas as pd from pandas.core.interchange.column import PandasColumn from pandas.core.interchange.dataframe import PandasDataFrameXchg import anndata as ad import scanpy as sc class ObsDF(pd.core.interchange.dataframe_protocol.DataFrame): def __init__(self, adata: ad.AnnData, layer: str | None = None, allow_copy: bool = True): self.adata = adata self.layer = layer self.allow_copy = allow_copy def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True): return ObsDF(self.adata, self.layer, allow_copy=allow_copy) @property def metadata(self) -> dict[str, pd.Index]: # `index` isn't a regular column, and the protocol doesn't support row # labels - so we export it as Pandas-specific metadata here. return {"pandas.index": self.adata.obs_names} def get_chunks(self, n_chunks=None): if n_chunks and n_chunks > 1: size = len(self._df) step = size // n_chunks if size % n_chunks != 0: step += 1 for start in range(0, step * n_chunks, step): yield ObsDf( self.adata[start : start + step, :], layer=self.layer, allow_copy=self.allow_copy, ) else: yield self def get_columns(self): raise NotImplementedError() def column_names(self): return list(adata.obs.columns) + list(adata.var_names) def num_chunks(self): return 1 def get_column_by_name(self, name: str): return PandasColumn(pd.Series(self.adata.obs_vector(name, layer=self.layer), index=self.adata.obs_names)) def get_column(self, i: int): return self.get_column_by_name(self.column_names()[i]) def num_columns(self) -> int: return len(self.column_names()) def num_rows(self) -> int: return self.adata.n_obs def select_columns_by_name(self, names: list[str]): return PandasDataFrameXchg(sc.get.obs_df(self.adata, names, layer=self.layer)) def select_columns(self, indices): all_names = self.column_names() return self.select_columns_by_name([all_names[i] for i in indices]) ```

Looks like altair/ data fusion currently don't support the protocol well enough for us to be able to use them.

https://github.com/hex-inc/vegafusion/issues/386

ivirshup commented 1 year ago

Sadly, looks like the same for seaborn. Just uses the interchange to convert whatever type you pass to a pandas dataframe.

scverse / anndata