Updates to Loom schema - Githubissues

This isssue tracks two different proposed updates to the Loom schema. If we use default values we can easily implement them is such a way that everything is backwards compatible with older files that miss these fields.

Adding the date of creation to the loom file's metadata

Right now we keep track of the age of the loom file by what the file-system reports as the last time it was modified. Since the data is supposed to be immutable I thought this would be fine. However, some operations do seem to update the file, at least enough for the OS to think it was recently modified. This screws up the sorting of our datasets, which defaults to newest-first. What we really want is to access the creation date, but keeping track of file creation time is not supported on Linux.

So, the simple solution here is to store the date a file was created in the loom file itself.

Extra benefit: imagine we want to recreate an existing loom file because we improved, say, the backspin algorithm, or added a new field to it. While all other "general" metadata would remain identical, the creation date would change. So if we ever implement caching, we can rely on this to see if previously downloaded data is "stale" or not.

Naming and labelling the rows/columns

Right now, for plotters that show cells, we can fetch individual gene information. To do so, we look at one unofficial "magical" row attribute, Gene, to label individual rows. Users can then select which gene to fetch, which is a matter of finding which row is associated with which gene label, and then fetching that row number. That principle can be made more generic, in a backwards-compatible way.

First, we add a name field for both rows and columns. By default, this would be Genes and Cells, but we could actually put any kind of data in there; the format is pretty data-agnostic. Then, we add a keyAttr field for both as well. This would tell the client which of the metadata attributes is responsible for labelling the individual rows/columns. By default, this would be Gene and (I assume) Cell_ID

The biggest benefit of this is we can now more easily use the symmetry inherent to rows/columns at the code level. At the moment I have to manually duplicate the code for row and column plotters, rewriting "row" to "col" or the other way around every time I do so, taking extra care to not forget anything or make any typos. With the updated schema, I can rewrite everything to generic metadata-, scatter- and sparkline plotters (which consists of removing column/row specific code, so not much work, plus it would clean up the code a lot). These would then be wrapped in simple rowView/colView components that pass on the (row-/col-) attrs, name, and keyAttr, plus relevant view settings.

The end result is:

Loom files could represent other data than genes and cells, and the website would automatically adjust its labels for it
If it ever makes sense to have a different key attribute than genes or cell id, that is now possible
Any fix/improvement to a plotter would apply to both columns and rows immediately
Any new plotters would immediately support both rows and columns
Less code, so a smaller website and less places to introduce bugs
Cleaner code (less special-purpose code for rows/columns in plotters), again making it easier to avoid bugs

Checklist of changes

Loom file generation:
- [x] add creationDate field
- [x] add colName field, defaults to "Genes"
- [x] add colKeyAttr field, defaults to "Gene"
- [x] add rowName field, defaults to "Cell"
- [x] add rowKeyAttr field, defaults to "Cell_ID"
Loom server:
- [x] serve creationDate metadata with the dataset-list
- [x] for individual datasets, send all of the above as part of its metadata
- [x] use default values for loom files that miss the above data for backwards compatibility
Loom client:
- [ ] make datasetlist use creationDate instead of lastModified (or we could support both, but I think that's just noisy)
- [ ] replace navbar with dropdowns for rows and columns
- [ ] make menu use rowName and colName values (in practice "Genes" and "Cells")
- dropdown menu has metadata, scatterplot and sparkline views
- [x] make views use rowKeyAttr/colKeyAttr` to determine which attribute labels the rows/cols when selecting data to fetch
- [x] generalise metadata view
- [x] generalise scatterplot view
- [ ] generalise sparkline view, extend it to cells
- [x] create row/col wrapper view for the above views
- [x] update ReactRouter scheme for views:
  - old: /dataset/cellmetadata/:project/:dataset(/:viewsettings), etc
  - new: /dataset/row/:view/:project/:dataset(/:viewsettings), where "view" is metadata, sparkline, scatterplot, etc.
  - (note that heatmap is unchanged by any of this)

So part one of this update seems simple enough on the pipeline side. I've already modified this bit and pushed to Github.

What's slightly trickier is the second part. First, I wanted to check that I can safely assume all loom files will be generated a Gene and CellID attribute for now. Looking at the source files:

create_from_cef (loompy.py) has no guarantees, since it depends on an external CEF file, but ceftools.go features both Gene and CellID
create_from_pandas (loompy.py)is deprecated, so irrelevant
create_from_cellranger (loompy.py) adds Gene and CellID.
- create_loom (loom_pipeline.py) has GeneName and CellID.

So it looks like we're good to go, except that I misremembered CellID as Cell_ID, and for GeneName in loom_pipeline.py. I guess the latter is because that's how they're stored in the MySQL database? What I propose to do is to add the last two lines of code after the part that stores the row attributes in a dictionary:

    for i in range(len(transcriptome_headers)):
        hdr = transcriptome_headers[i]
        if i >= N_STD_FIELDS:
            hdr = "(" + dataset + ")_" + hdr
        row_attrs[hdr] = []
        for j in range(len(rows)):
            row_attrs[hdr].append(rows[j][i])
        row_attrs[hdr] = self._make_std_numpy_type(row_attrs[hdr], cursor.description[i][1])
    row_attrs['Gene'] = row_attrs['GeneName'] # rename `GeneName` attribute to `Gene`
    del row_attrs['GeneName']                 # delete superfluous attribute

I'm not entirely sure I'm not breaking something so can I have a confirmation this is a safe change, @slinnarsson?

Since we're not going to use it for now, we can skip the customisation options that adding names and keyAttrs gives, and just use the default name/keyAttr values I suggested to save time. If we would like to expose that at a later time, I guess we'll have to add them as extra function arguments in the relevant functions in loompy.py and loom_pipeline.py, and as optional flags when creating a loom file from the loom CLI.

linnarsson-lab / loom-viewer

Updates to Loom schema #85

Adding the date of creation to the loom file's metadata

Naming and labelling the rows/columns

Checklist of changes