Proposal: Immutable anndata

chaichontat commented 2 years ago

Hi! I have been using scanpy and anndata and start noticing potential difficulties as we analyze more complex datasets.

Problem

There can be a ballooning number of implicit dependencies in adata. For example, running sc.pl.pca_variance_ratio requires running sc.tl.pca first, the results of which are written directly into adata. This may not seem like a big problem but can lead to inconsistencies down the line if the we run the first function with different parameters but forget to run the second again. The number of columns in obs and var can also grow to the point that we lose track of how each column came to be. Each step of the function could accidentally cause side effects that the user may not be aware of.

Proposed solution

This issue could be fixed by making anndata and its applications more functional. By making anndata immutable, we could prevent inadvertent modification and cache intermediate results using lru_cache. This way, we won't need to modify adata within a session. Important or expensive functions would return their results in its own container or dataclass with methods for working with those. Functions would rely less on side-effects, which would lead to more predictable outcomes in the long run. Another benefit would be the possibility of safely parallelizing pipelines in the future.

At the end of the session, expensive results could be saved along with their associated parameters as a folder in an HDF5 file, an automatic data provenance mechanism. The user could then pick which of the fields they want to save. Mappings from obs or var to this container could be created to maintain backward compatibility.

Thanks for reading! Let me know if you are interested in this approach.

Example implementation

# pyright: reportUnknownMemberType=false
#%%
from dataclasses import dataclass, make_dataclass
from functools import lru_cache, wraps
from typing import (
    Any,
    Callable,
    Collection,
    Mapping,
    ParamSpec,
    Protocol,
    TypedDict,
    TypeVar,
    cast,
)

import matplotlib.pyplot as plt
import scanpy as sc
import static_frame as sf
from matplotlib.collections import PathCollection
from scipy.sparse._csr import csr_matrix
from sklearn.decomposition import TruncatedSVD

adata = sc.read_10x_mtx("hg19", var_names="gene_symbols", cache=True)

# Make read-only.
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
adata.X.data.flags.writeable = False
adata.X.indices.flags.writeable = False
adata.X.indptr.flags.writeable = False

class IsDataclass(Protocol):
    __dataclass_fields__: dict[str, Any]

class IsDataclassKwargs(Protocol):
    __dataclass_fields__: dict[str, Any]
    _hash_adata: int
    kwargs: Mapping[str, Any]

@dataclass
class StaticAnnData:
    # FrameHE is an immutable, hashable dataframe.
    X: sf.FrameHE | csr_matrix
    obs: sf.FrameHE = None  # type: ignore
    var: sf.FrameHE = None  # type: ignore
    analyses: Mapping[str, IsDataclassKwargs] = {}
    _obsp: Mapping[str, tuple[str, str]] = None  # type: ignore # Shortcut to dataframes in analyses.

    def __post_init__(self):
        self.shape = self.X.shape
        self.obs = self.obs or sf.FrameHE()  # Removes None
        self.var = self.var or sf.FrameHE()
        self.analyses = {} if len(self.analyses) == 0 else self.analyses
        self._obsp = {} if len(self._obsp) == 0 else self._obsp

    def __hash__(self) -> int:
        # Use id as hash for now since we're not comparing objects and X is immutable.
        return hash((id(self.X), self.obs, self.var))  # et al.

    @property
    def obsp(self) -> Mapping[str, sf.FrameHE]:
        # Needs to build new dict every time in case the analyses changed.
        return {k: getattr(self.analyses[v[0]], v[1]) for k, v in self._obsp.items()}

    # Example save function.
    # Save only expensive computations.
    def save(
        self, objs: list[IsDataclassKwargs], obsp: Mapping[str, tuple[str, str]]
    ) -> None:
        for _ in objs:
            "Save in adata.analyses as folders in HDF5?"
        # Write obsp
        return

#%% Example workflow with PCA.
DC = TypeVar("DC", bound=IsDataclassKwargs)
P = ParamSpec("P")

# Saves keyword arguments along with results.
def ret_type(Data: type[DC]) -> Callable[[Callable[P, Any]], Callable[P, DC]]:
    def get_args(f: Callable[P, Any]) -> Callable[P, DC]:
        @wraps(f)
        def inner(*args: P.args, **kwargs: P.kwargs) -> DC:
            kwargs.update(_hash_adata=hash(adata))
            if isinstance(ret := f(*args, **kwargs), Mapping):
                return Data(**ret, kwargs=kwargs)  # type: ignore
            if isinstance(ret, Collection):
                return Data(*ret, kwargs=kwargs)  # type: ignore
            return Data(ret, kwargs=kwargs)  # type: ignore

        return inner

    return get_args

# Dataclass for results and parameters of each analysis.
@dataclass(frozen=True)
class PCA:
    class PCAKwargs(TypedDict):
        n_components: int

    transformed: sf.FrameHE  # obsp
    components: sf.FrameHE  # var
    explained_variance: sf.SeriesHE
    kwargs: PCAKwargs
    _hash_adata: int  # Used when saved.

    # Plot functions stay with the results. Also allows access to the underlying data.
    def plot(self, ax: plt.Axes | None = None, **kwargs: Any) -> PathCollection:
        ax = ax or plt  # type: ignore # Use plt if no ax is provided.
        assert ax is not None
        return ax.scatter(
            self.transformed[0].values, self.transformed[1].values, **kwargs
        )

# Cache only expensive functions. Others could be computed JIT.
# lru_cache eats type information until https://github.com/microsoft/pylance-release/issues/764 is fixed.
@ret_type(PCA)
@lru_cache(2)
def run_pca(
    adata: StaticAnnData, *, n_components: int = 2
) -> tuple[sf.FrameHE, sf.FrameHE, sf.SeriesHE]:
    p = TruncatedSVD(n_components=n_components)
    p.fit(adata.X)
    # Need to be immutable.
    # Returns correspond to fields in the dataclass.
    return (
        sf.FrameHE(p.fit_transform(adata.X)),
        sf.FrameHE(p.components_),
        sf.SeriesHE(p.explained_variance_),
    )

x = StaticAnnData(adata.X)

run_pca(x, n_components=3).plot()

res = run_pca(x, n_components=3)
res2 = run_pca(x, **res.kwargs)  # Reproducible run.

# Another function down the line.
# Explicit or cached implicit dependencies.
def pca_plus_one(adata: StaticAnnData, pca: PCA | None = None) -> sf.FrameHE:
    if pca is None:
        pca = cast(PCA, adata.analyses.get("pca", None))
        pca = pca or run_pca(adata)
        assert isinstance(pca, PCA)
    return pca.components + 1

Zethson commented 2 years ago

Hi,

thank you very much for this high quality issue with even a small POC!

The number of columns in obs and var can also grow to the point that we lose track of how each column came to be.

There have been discussions to record all operations with modified an AnnData object @ivirshup @falexwolf.

Functions would rely less on side-effects, which would lead to more predictable outcomes in the long run. Another benefit would be the possibility of safely parallelizing pipelines in the future.

Agree. This would be great.

I'd like to note that everything modifying one fat AnnData object inplace is in my opinion very beginner friendly. But I can certainly see the issues and benefits that you outline. Without making any recommendations (I need to think about this more) I am interested in what others are thinking.

falexwolf commented 2 years ago

@chaichontat, thank you for the in-depth elaboration and proposals!

I agree on the downsides of the present in-place data science workflows that you're pointing out.

I agree with @Zethson that for simple workflows, the current API seems very intuitive and beginner-friendly.

But I also think that for complex cases the problems are indeed so severe that there is indeed much danger to making mistakes and getting lost. A more safe/robust and more convenient/readable/intepretable API would be absolutely desirable.

There are a variety of ways of arriving there. I think that all of these ways have in common that the book-keeping of the data science workflow (API calls) needs to be dealt with explicitly. Right now, it's merely implicit in the order of keys added to AnnData. Instead, "Bookkeeping" and "datacontaining" should likely be separated with explicit objects.

I could imagine though, that the data scientist may still just work with one combined data container object, just that that container object is subject to "strong supervision" from a bookkeeper object in the background. I'd suggest calling such a data container one that displays "conditional/supervised immutability". Already written slots of data within the container cannot be mutated, unless the Bookkeeper determines it's safe. Or if it's append-only.

After studying and learning a lot, I think the solution you prototyped @chaichontat seems a very good way of resolving the mentioned issues through several very elegant ideas! In my view, to produce an elegant workflow user experience that could approach the simplicity of the current one, it should be complemented by a Bookkeeper. I think this is the direction in which you're going by saying

Mappings from obs or var to this container could be created to maintain backward compatibility.

I'll also need to think more about it. And how all of this can be reconciled by people who want a more object-oriented API.

I'm looking forward to reading what others say!

chaichontat commented 2 years ago

static-frame actually has a FrameGO object that is append-only.

We could also integrate anndata into a data version control system, similar to what the ML people are using. We get appendable data with full history. https://dvc.org/doc/use-cases/versioning-data-and-model-files

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

scverse / anndata