Open ivirshup opened 1 year ago
Very rough proof of concept:
Looks like altair/ data fusion currently don't support the protocol well enough for us to be able to use them.
Sadly, looks like the same for seaborn. Just uses the interchange to convert whatever type you pass to a pandas dataframe.
Please describe your wishes and possible alternatives to achieve the desired result.
https://data-apis.org/dataframe-protocol/latest/index.html
It could be nice if AnnData supported the
__dataframe__
interchange protocol, especially when used by libraries which will use theselect_columns_by_name
,get_column_by_name
interfaces.Use-case: plotting
The biggest use case is plotting. Both seaborn (https://github.com/mwaskom/seaborn/pull/3369) and altair (https://github.com/altair-viz/altair/pull/2888) now support inputs in the dataframe protocol.
In
scanpy
we typically use thesc.get.obs_df
method to create a dataframe for plotting. A major painpoint for this in analysis code is that the user has to provide the keys they want to plot multiple times, once for creating the dataframe, and again to the plotting interface. Instead of having to do:It could eventually be:
This should also work for plots of gene expression values, especially if the underlying plotting library selects columns through the dataframe interface and the matrix was stored as CSC or dense.
This could even be a nice interface to on-disk data, especially when
X
/layers
is stored inCSC
.Some more detail
.obs.columns
,var_names
, keys likeobsm/pca/0
.var_names
layer
is being accessedImplementation
I think it would make sense for this to start out as POC outside of the main implementation. It may require
pyarrow
as a dependency to work. In theorypyarrow
be a dependency ofpandas
v3 early next year, so may not be an issue.cc: @ilan-gold