Closed JobLeonard closed 7 years ago
So part one of this update seems simple enough on the pipeline side. I've already modified this bit and pushed to Github.
What's slightly trickier is the second part. First, I wanted to check that I can safely assume all loom files will be generated a Gene
and CellID
attribute for now. Looking at the source files:
create_from_cef
(loompy.py
) has no guarantees, since it depends on an external CEF file, but ceftools.go
features both Gene
and CellID
create_from_pandas
(loompy.py
)is deprecated, so irrelevantcreate_from_cellranger
(loompy.py
) adds Gene
and CellID
.
So it looks like we're good to go, except that I misremembered CellID
as Cell_ID
, and for GeneName
in loom_pipeline.py
. I guess the latter is because that's how they're stored in the MySQL database? What I propose to do is to add the last two lines of code after the part that stores the row attributes in a dictionary:
for i in range(len(transcriptome_headers)):
hdr = transcriptome_headers[i]
if i >= N_STD_FIELDS:
hdr = "(" + dataset + ")_" + hdr
row_attrs[hdr] = []
for j in range(len(rows)):
row_attrs[hdr].append(rows[j][i])
row_attrs[hdr] = self._make_std_numpy_type(row_attrs[hdr], cursor.description[i][1])
row_attrs['Gene'] = row_attrs['GeneName'] # rename `GeneName` attribute to `Gene`
del row_attrs['GeneName'] # delete superfluous attribute
I'm not entirely sure I'm not breaking something so can I have a confirmation this is a safe change, @slinnarsson?
Since we're not going to use it for now, we can skip the customisation options that adding names and keyAttrs gives, and just use the default name/keyAttr values I suggested to save time. If we would like to expose that at a later time, I guess we'll have to add them as extra function arguments in the relevant functions in loompy.py
and loom_pipeline.py
, and as optional flags when creating a loom file from the loom
CLI.
Mental note: put sorting back into viewsettings, so it can be linked
This isssue tracks two different proposed updates to the Loom schema. If we use default values we can easily implement them is such a way that everything is backwards compatible with older files that miss these fields.
Adding the date of creation to the loom file's metadata
Right now we keep track of the age of the loom file by what the file-system reports as the last time it was modified. Since the data is supposed to be immutable I thought this would be fine. However, some operations do seem to update the file, at least enough for the OS to think it was recently modified. This screws up the sorting of our datasets, which defaults to newest-first. What we really want is to access the creation date, but keeping track of file creation time is not supported on Linux.
So, the simple solution here is to store the date a file was created in the loom file itself.
Extra benefit: imagine we want to recreate an existing loom file because we improved, say, the backspin algorithm, or added a new field to it. While all other "general" metadata would remain identical, the creation date would change. So if we ever implement caching, we can rely on this to see if previously downloaded data is "stale" or not.
Naming and labelling the rows/columns
Right now, for plotters that show cells, we can fetch individual gene information. To do so, we look at one unofficial "magical" row attribute,
Gene
, to label individual rows. Users can then select which gene to fetch, which is a matter of finding which row is associated with which gene label, and then fetching that row number. That principle can be made more generic, in a backwards-compatible way.First, we add a
name
field for both rows and columns. By default, this would beGenes
andCells
, but we could actually put any kind of data in there; the format is pretty data-agnostic. Then, we add akeyAttr
field for both as well. This would tell the client which of the metadata attributes is responsible for labelling the individual rows/columns. By default, this would beGene
and (I assume)Cell_ID
The biggest benefit of this is we can now more easily use the symmetry inherent to rows/columns at the code level. At the moment I have to manually duplicate the code for row and column plotters, rewriting "row" to "col" or the other way around every time I do so, taking extra care to not forget anything or make any typos. With the updated schema, I can rewrite everything to generic metadata-, scatter- and sparkline plotters (which consists of removing column/row specific code, so not much work, plus it would clean up the code a lot). These would then be wrapped in simple rowView/colView components that pass on the (row-/col-)
attrs
,name
, andkeyAttr
, plus relevant view settings.The end result is:
Checklist of changes
creationDate
fieldcolName
field, defaults to "Genes"colKeyAttr
field, defaults to "Gene"rowName
field, defaults to "Cell"rowKeyAttr
field, defaults to "Cell_ID"creationDate
metadata with the dataset-listcreationDate
instead oflastModified
(or we could support both, but I think that's just noisy)rowName
andcolName
values (in practice "Genes" and "Cells")rowKeyAttr/
colKeyAttr` to determine which attribute labels the rows/cols when selecting data to fetch/dataset/cellmetadata/:project/:dataset(/:viewsettings)
, etc/dataset/row/:view/:project/:dataset(/:viewsettings)
, where "view" ismetadata
,sparkline
,scatterplot
, etc.