linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Updates to Loom schema #85

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

This isssue tracks two different proposed updates to the Loom schema. If we use default values we can easily implement them is such a way that everything is backwards compatible with older files that miss these fields.

Adding the date of creation to the loom file's metadata

Right now we keep track of the age of the loom file by what the file-system reports as the last time it was modified. Since the data is supposed to be immutable I thought this would be fine. However, some operations do seem to update the file, at least enough for the OS to think it was recently modified. This screws up the sorting of our datasets, which defaults to newest-first. What we really want is to access the creation date, but keeping track of file creation time is not supported on Linux.

So, the simple solution here is to store the date a file was created in the loom file itself.

Extra benefit: imagine we want to recreate an existing loom file because we improved, say, the backspin algorithm, or added a new field to it. While all other "general" metadata would remain identical, the creation date would change. So if we ever implement caching, we can rely on this to see if previously downloaded data is "stale" or not.

Naming and labelling the rows/columns

Right now, for plotters that show cells, we can fetch individual gene information. To do so, we look at one unofficial "magical" row attribute, Gene, to label individual rows. Users can then select which gene to fetch, which is a matter of finding which row is associated with which gene label, and then fetching that row number. That principle can be made more generic, in a backwards-compatible way.

First, we add a name field for both rows and columns. By default, this would be Genes and Cells, but we could actually put any kind of data in there; the format is pretty data-agnostic. Then, we add a keyAttr field for both as well. This would tell the client which of the metadata attributes is responsible for labelling the individual rows/columns. By default, this would be Gene and (I assume) Cell_ID

The biggest benefit of this is we can now more easily use the symmetry inherent to rows/columns at the code level. At the moment I have to manually duplicate the code for row and column plotters, rewriting "row" to "col" or the other way around every time I do so, taking extra care to not forget anything or make any typos. With the updated schema, I can rewrite everything to generic metadata-, scatter- and sparkline plotters (which consists of removing column/row specific code, so not much work, plus it would clean up the code a lot). These would then be wrapped in simple rowView/colView components that pass on the (row-/col-) attrs, name, and keyAttr, plus relevant view settings.

The end result is:

Checklist of changes

JobLeonard commented 7 years ago

So part one of this update seems simple enough on the pipeline side. I've already modified this bit and pushed to Github.

What's slightly trickier is the second part. First, I wanted to check that I can safely assume all loom files will be generated a Gene and CellID attribute for now. Looking at the source files:

So it looks like we're good to go, except that I misremembered CellID as Cell_ID, and for GeneName in loom_pipeline.py. I guess the latter is because that's how they're stored in the MySQL database? What I propose to do is to add the last two lines of code after the part that stores the row attributes in a dictionary:

    for i in range(len(transcriptome_headers)):
        hdr = transcriptome_headers[i]
        if i >= N_STD_FIELDS:
            hdr = "(" + dataset + ")_" + hdr
        row_attrs[hdr] = []
        for j in range(len(rows)):
            row_attrs[hdr].append(rows[j][i])
        row_attrs[hdr] = self._make_std_numpy_type(row_attrs[hdr], cursor.description[i][1])
    row_attrs['Gene'] = row_attrs['GeneName'] # rename `GeneName` attribute to `Gene`
    del row_attrs['GeneName']                 # delete superfluous attribute

I'm not entirely sure I'm not breaking something so can I have a confirmation this is a safe change, @slinnarsson?

Since we're not going to use it for now, we can skip the customisation options that adding names and keyAttrs gives, and just use the default name/keyAttr values I suggested to save time. If we would like to expose that at a later time, I guess we'll have to add them as extra function arguments in the relevant functions in loompy.py and loom_pipeline.py, and as optional flags when creating a loom file from the loom CLI.

JobLeonard commented 7 years ago

Mental note: put sorting back into viewsettings, so it can be linked