linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

Loom file format / conventions / metadata #98

Open KrisDavie opened 5 years ago

KrisDavie commented 5 years ago

Hey All,

I’m one of the developers behind SCope (https://github.com/aertslab/SCope) and we have been leveraging the loom file format for a long time now to store data not only for SCope, but projects in general. More recently I have been working on implementing pipelines for single-cell analysis, with the use of SCope for visualisation. For this reason we are using Loom files as intermediate and final data stores, allowing researchers to easily visualise and explore their data.

One of the things that we realised when first designing SCope was the convenience of being able to store multiple different analyses in a single loom file and visualise them together. With this in mind, we built functionality to display a variety of data types and visualise multiple attributes or dimensionality reductions at the same time. However, we are aware that this was not in original intended use for Loom files and the original idea was to store single analyses in single files. The Loom conventions mention that this is to avoid the need to keep track of relationships between attributes. With the large explosion of single-cell analysis tools we believe that keeping every analysis separate is going to become laborious and space-inefficient, due to the need of keeping an expression matrix in every loom.

With this in mind, we continued to use loom files to store many different analyses at the same time, and ‘solved’ the issue of attribute relationships by implementing a JSON based metadata which we are currently storing within the file attributes section (which as we know is soft-limited to 16kB with some libraries setting this as a hard-limit). I have recently completed a JSON schema which defines the structure of this metadata (https://github.com/aertslab/SCope/commit/74fafd9eda2f3bf103da2aeb2c79da9fee9602f9), we have also had interest from others in this type of metadata and the way that we store it. To overcome the size limit we began compressing and encoding our metadata allowing us to squeeze ~100kB data into this space. For one loom file, this MetaData allows us to store cluster IDs and corresponding labels for 8 different Seurat cluster resolutions (totalling >900 clusters), data for 150 regulons generated with SCENIC (including names and several thresholds) as well as a few other various bits of information. However, this does reach the limits of what we can store even after compression.

Aside from the metadata extensions, we have been applying multi-dimensional arrays to column/row attributes, allowing us to store a wider variety of data also. One example are row attributes which define markers (as well as the p-values and log fold changes) of clusters stored within the loom file. Another example is a column attribute which stores the AUC matrix generated from SCENIC (Cells x Transcription factors). To simplify this, we have been storing these as named arrays, which allows us to keep identifying column/row names (crucial for the AUC matrix), we did notice that these attributes are no longer valid in a loom file after loompy v2.0.14ish.

Within the scope of using loom files for (containerized) analysis pipelines, we have been thinking of other types of information that we would like to store in some kind of MetaData section, and are thinking that some kind of transaction log would be great (like the PG header tag in the BAM format). The idea being that alongside the actual data, we would be able to store information on the tool versions and parameters used to generate all of the data and analyses within the loom file. This would result in a self-contained data store which includes comprehensive information on its origin and generation, moving towards well documented and reproducible analyses.

With all of this being said, we have a couple of questions to @slinnarsson and the other loompy developers.

  1. Do you guys see the loom file format moving forward to contain multiple analyses and analysis types like we are currently doing, and if so, are you up for brainstorming exactly how this could be approached?

  2. How do you envision the kind of metadata we are talking about being stored in future versions of loompy? I know that there is some discussion in https://github.com/linnarsson-lab/loompy/issues/51 and I apologize that we have not participated here yet?

  3. We noticed that some issues are being tagged for loompy3, but haven’t seen a branch specifically for this, is this already in development or is it something more conceptual at the moment? We would be happy to contribute to the project to help move things forward.

  4. Finally, do you see how we are doing things as too different? If so, do you prefer that we fork from loompy and work on a version specifically aimed towards our goals and SCope? Although we are OK to do this, we would prefer not to as we really think focussing together on a common format would be better for the community as a whole.

We look forward to hearing your views and maybe working together on this in the future!

Cheers,

Kris

slinnarsson commented 5 years ago

Hi

Thanks for reaching out! I will give this some more thought before I respond in full, but just wanted to acknowledge your posting and say that I'm very much in favour of trying to accommodate the need of all the various pipelines out there. We also need to be mindful of backwards compatibility, and of the need to bring along other loom readers/writers such as loomR.

Right now I'm thinking that some of the metadata issues could be layered on top of the current loom format (and maybe that's what you're proposing). I.e. there could be attributes of attributes. But let me think a little deeper.

Thanks

/Sten

slinnarsson commented 5 years ago

Sorry for being slow. Let's have a conference call to discuss! Can you email me? sten.linnarsson@ki.se