scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
555 stars 150 forks source link

write `uns` to loom #116

Open nh3 opened 5 years ago

nh3 commented 5 years ago

Hi Alex,

I am interested in using loom as an exchange format. Similar to #111, uns is not carried over, which limits its use. I am aware of https://github.com/linnarsson-lab/loompy/issues/1. As loom 2.0 supports having datasets as global attributes, would it be possible now for write_loom() to write uns to loom? One possible way of doing it might be putting names and types (scalar, dataset) of the items in a dataset, say /uns_manifest, and write /{uns_item_name} for each of the item. For obsm and varm, one might do the same, or put only metadata such as names, types and dimensions into global attributes and insert the actual data into obs/var such as X_pca_1, ..., X_pca_n. Then, it should be easy to re-create uns, obsm and varm in a standard way when reading. I am happy to discuss further and/or make PR for read_loom() and write_loom() if needed. Please let me know what you think. Thanks!

nh3 commented 5 years ago

Sorry, realised that global attributes can not be dataset but at most multidimensional datatype, making them not so useful in this context. Therefore I am closing this issue.

nh3 commented 5 years ago

Hi, I recently implemented a converter between anndata and loom (https://github.com/ebi-gene-expression-group/scanpy-scripts/pull/37) with the intention of further interoperability. It allows a relatively complete transfer back and forth of .uns in addition to .obsm and .varm, by automatically generating and storing a manifest in loom. There're some hard coded parts (e.g. special handling of .uns['neighbors']) that's apparently not idea, but if you think the general approach is acceptable, I am happy to make an improved version for a PR.

LuckyMD commented 5 years ago

@sophietr @mbuttner This might be an intermediate solution until cellxgene grows in functionality...

falexwolf commented 5 years ago

Thanks! We're happy to have an improved loom export and import if it goes along with looms canonical functionality. If we are at risk of "doing strange things to loom files", then we'd better not do it.

Is "generating a manifest in loom" something that loom foresees? If yes, then happy to go over a PR. :)

nh3 commented 5 years ago

A proposed feature for loom v3 is the /global group where datasets of unrestricted shape can go according to https://github.com/linnarsson-lab/loompy/issues/51. There isn't specific mention of a manifest table under /global, but it is compatible.

As loom v3 is not yet implemented/announced, what I do is setting LOOM_SPEC_VERSION to a special value ('3.0.0alpha' in this case) when writing, and when reading if version doesn't match this value then revert to what sc.read_loom() does. Actually, all the extra bits go under /global when exporting and the generated loom passes loompy v2's validation, so other loom reader should read it without problem (just that they can't read the extra bits). Do you think this is acceptable?

falexwolf commented 5 years ago

@nh3 That sounds reasonable!

@slinnarsson With this we're finally addressing one of the initial questions that I had about loom (unstructured, global annotation). Are you fine with @nh3 supporting this within anndata as laid out above? Thanks for briefly taking the time of approving!

slinnarsson commented 5 years ago

Hi

Sure, sounds good. To be clear, this is essentially option 2 from https://github.com/linnarsson-lab/loompy/issues/51 ?

nh3 commented 5 years ago

Hi @slinnarsson and @falexwolf,

Yes, this is essentially option 2, with a mandatory /global/manifest to store the path and data type for the stored datasets. The minimum structure would look like this:

    /.attrs['LOOM_SPEC_VERSION'] = '3.0.0alpha'
    /global
    /global/manifest
    /matrix
    /layers
    /col_attrs
    /col_graphs
    /row_attrs
    /row_graphs

where /global/manifest is a table with at least two columns: loom_path, type. More columns can be added to indicate where the dataset should go in the object supported for conversion. Currently, it aims to support AnnData and SingleCellExperiment, so would have additional columns called anndata_path and sce_path. Here are some examples:

/global/reducedDim__pca                     array   /obsm/X_pca                       @reducedDims$PCA                       # A row for PCA embeddings
/global/pca__variance                       array   /uns/pca/variance                 @metadata$pca$variance                 # A row for PCA variances
/col_graphs/neighbors__connectivities       graph   /uns/neighbors/connectivities     @colGraphs$neighbors__connectivities   # A row for KNN graph
/.attrs[louvain__parameters__random_state]  scalar  /uns/louvain/params/random_state  @metadata$louvain$params$random_state  # A row for louvain random seed

This table is generated largely automatically (with some hard-coded special treatment for certain slots) when writing to loom, and the reader function in python or R put the data into specified place fully automatically.

I flatten the path when writing to loom since I wasn't sure from my reading whether or not /global supports nested groups under it.

Anyway, many details can be agreed on and adjusted later, but this is the general approach.

For the python part, it calls read_loom() write_loom() from scanpy and then uses h5py to do the extra stuff. It lacks data compression and timestamp at the moment but can be implemented with h5py, or, better yet, loompy if there's API for that. For the R part, it calls import() export() from LoomExperiment and then uses rhdf5 for the rest.

Hopefully this isn't duplicating what's already implemented in loompy v3. Please let me know what you think.

Many thanks.

falexwolf commented 5 years ago

Yes, let's please make sure we don't duplicate loompy code within anndata. Within anndata, we should simply use some top-level functions that pass AnnData's fields into the appropriate loompy funtions.

flying-sheep commented 1 year ago

I wonder what the path forward here is:

If I interpret the other issues correctly, loompy 3.x nowadays has the necessary functionality to make options 2 and 3 feasible.