linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Figure out how to map Loom data to Vega specs #59

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

On the one hand we have Vega (or D3, if we go with that instead), who expects data to have more of a tabular structure. From the Vega documentation on data (emphasis mine)

_The basic data model used by Vega is tabular data, similar to a spreadsheet or database table._ Individual data sets are assumed to contain a collection of records (or "rows"), which may contain any number of named data attributes (fields, or "columns"). Upon load, Vega copies each data record into a new data object and assigns it a unique _id.

For example, if a Vega spec loads input JSON data like this:

[{"x":0, "y":3}, {"x":1, "y":5}]

the input data is then loaded into data objects like this:

[{"_id":0, "x":0, "y":3}, {"_id":1, "x":1, "y":5}]

If the input JSON is simply an array of primitive values, Vega maps each value to the data property of a new object with a unique _id. For example [5, 3, 8, 1] is loaded as:

[{"_id": 0, "data": 5}, {"_id": 1, "data": 3},
 {"_id": 2, "data": 8}, {"_id": 3, "data": 1}]

Data sets can be specified directly (either through including the data inline or providing a URL from which to load the data), or bound dynamically at runtime (by providing data at chart instantiation time). Note that loading data from a URL will be subject to the policies of your runtime environment (e.g., cross-domain request rules).

On the other hand we have the loom spec (emphasis mine again)

The .loom file format is designed to efficiently hold large omics datasets. _Typically, such data takes the form of a large matrix of numbers, along with metadata for the rows and columns._ For example, single-cell RNA-seq data consists of expression measurements for all genes (rows) in a large number of cells (columns), along with metadata for genes (e.g. Chromosome, Strand, Location, Name), and for cells (e.g. Species, Sex, Strain, GFP positive).

_We designed .loom files to represent such datasets in a way that treats rows and columns the same. You may want to cluster both genes and cells, you may want to perform PCA on both of them, and filter based on quality controls. SQL databases and other data storage solutions almost always treat data as a table, not a matrix, and makes it very hard to add arbitrary metadata to rows and columns._ In contrast, .loom makes this very easy.

Furthermore, current and future datasets can have tens of thousands of rows (genes) and hundreds of thousands of columns (cells). We designed .loom for efficient access to arbitrary rows and columns.

Finally, the annotated matrix format lends itself to very natural representation of common analysis tasks. For example, the result of a clustering algorithm can be stored simply as another attribute that gives the cluser ID for each cell. Dimensionality reduction such as PCA or t-SNE, similarly, can be stored as two attributes giving the projection coordinates of each cell.

Loom data is not stored in tabular form, for good reasons, but we still need to map the data that we want to visualize to tabular form if we want to use Vega (or, indeed, just about any framework out there).

So we need two things:

  1. To figure out what (meta)data to map for each data visualisation, and into what form
  2. Helper functions to make this mapping easy and less error-prone. Preferably working a way that doesn't trash the garbage collector with I-don't-know how many temporary temporary javascript objects.
JobLeonard commented 7 years ago

Since I've decided we're not using vega (see #79) this can be closed