What are the goals here?

ivirshup commented 5 years ago

I think it would be good to scope out the requirements of an interchange file format. This could probably start with some ideas of what the use cases are (basic user stories).

Individual bioinformatician wants to use tools from different ecosystems
Group of bioinformaticians want to collaborate on some data, but are familiar with different sets of tools
Data distribution platform want a format that the majority of packages can read
Tool developers want their analyses to be shareable, storable
Everyone wants this format to be fast to read and write, and reasonably space efficient

Some questions I have about what is reasonably achievable:

What does the current defacto, loom, not provide?
@LTLA has mentioned one-to-one mapping between representations. Where are the limits of this? Do we need serialized binary fallbacks in each case?
To what extent do we go for conventions v. generality?

A little expansion on "conventions v. generality":

In an AnnData object we don't have nested data frames, so I would imagine any nested dataframes could just be used as elements of obsm. This is probably also where we'd put reducedDims. How do we keep this information around? We could just know what kinds of names are reduced dimensions, or we'd have to "tag" the arrays.

@flying-sheep, from your working with in-memory exchange, do you have any thoughts on this?

flying-sheep commented 5 years ago

SingleCellExperiment is more specific (e.g. reducedDims exists while we have the more generic obsm), so concepts that are conventions in AnnData aren’t in SCE.

Is there anything point here you’d like to hear my opinion on specifically? :smiley:

ivirshup commented 5 years ago

I was wondering if you had thoughts on dealing with round-trip conversions when there wasn't clear one to one mappings. For example, going R->python->R with a SingleCellExperiment with nested dataframes. It's not obvious to me (from here) how you could deal with that. If you flatten, how do know what to unflatten? If you move them to obsm, how do you know what to move back to colData? Another example would be the SingleCellExperiment LinearEmbeddingMatrix, where the variable loadings never get subset, so it doesn't quite map to varm.

flying-sheep commented 5 years ago

I don’t handle anything tricky yet :sweat_smile: Almost everything I do is round-trippable (except for the name conversion which changes capitalization and would canonicalize the obsm/reducedDims name of diffusion maps – ad.obsm['X_dm'] → reducedDim(sce, 'DM') → ad.obsm['X_diffmap'])

What do you mean with flattening? Are there nested data.frames in SCE? What for?

ivirshup / sc-interchange

What are the goals here? #3