ivirshup / sc-interchange

Better interchange for single cell tools
8 stars 0 forks source link

sc-interchange

Better interchange for single cell tools

Purpose of the repository

The purpose of this repository is to enhance interoperability between tools for single cell analysis. The primary goal is to have better interchange file formats. That is, SingleCellExperiment users and AnnData users should be able to quickly share data without having to rely on suboptimal intermediates.

Active questions:

Existing solutions

Summary of existing formats

AnnData (v0.7+)

In memory

Based around two key dimensions, observations (obs, cells) and variables (var, genes). These are the the dimensions of the main expression matrix X and metadata about there are stored in a pair of dataframes obs and var.

Other matrices which are aligned to some set of the main axes are stored in mappings under the layers, obsm, varm, obsp, and varp attributes. These mappings can hold array-like objects (currently: arrays, sparse arrays, dataframes).

Anything which doesn't fit into these categories goes into the uns mapping.

On disk layout (.h5ad)

The on disk schema is broadly similar to the in memory one. The root group has keys for X, layers, obsm, varm, obsp, varp, and uns. Objects are represented as follows:

Each mapping is a group. We're starting to introduce conventions to idenitfy how each object should be decoded based on it's attrs, though this was previously done in an ad-hoc manner.

Loom

TODO

Official spec

SingleCellExperiment

The SingleCellExperiment class is derived from the SummarizedExperiment class and thus shares its features:

The SingleCellExperiment provides the following additional features:

These aspects are summarized in the figure below:

The SingleCellExperiment class itself has no mandated on-disk format. Individual matrices in assays or fields in the colData/rowData may be file-backed objects (e.g., HDF5Matrixs), but the choice of representation and file format is left to the discretion of the user.