linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Flesh out "standard" optional fields in the loom spec #50

Closed JobLeonard closed 7 years ago

JobLeonard commented 8 years ago

So I just found an issue in the fetchGene action that basically boiled down to both the server and .loom API having changed. The former was a change in endpoints due to the shift from docker to a CLI tool, which was well documented. The other was that <dataset>.rowAttr.GeneName had changed to <dataset>.rowAttr.Gene, which was not specified anywhere.

Currently, the .loom file format specification makes no mention of this - I suppose because the Gene field is optional anyway.

However the client-side code does make certain assumptions about which datasets will be present in colAttrs and rowAttrs, and what type of data they will contain (Gene documents the gene names, so you'd expect strings instead of integers or floats).

It might be a good idea to write those down. After all, they are essentially "keyword" fields with special meaning within the client-side code (it might just be that Gene is the only one at the moment, but still).

This would also reduce the amount of reverse-engineering for me (and others, once they start using this tool).

JobLeonard commented 8 years ago

(Aside: on the plus side, we got "x-by-genes" working again! Even though the resulting plots are nonsensical and there's a weird bug where only a random selection of points is plotted)

image

https://www.youtube.com/watch?v=6O3XjZZL6QI

slinnarsson commented 8 years ago

Hm yes that's hard UI challenge.

Original design

The original idea was, "let's have standard attributes like Gene, so we can assume they are always around, and simplify the UI". The ones we use currently I think are:

_tSNE1, _tSNE2 (t-SNE coordinates)
_PC1, _PC2  (PCA coordinates)
_Gene (gene names)

There are also others that are automatically generated by the pipeline, but where don't exploit them client-side currently (e.g. _LogCV, _LogMean, _Total, _Noise, _Excluded).

The benefit is that (for example) we can let the user select a row in the matrix based on a gene name, with no need for the user to specify where to search for that gene name. But this breaks silently if there's no Gene row attribute.

Possible solutions (for _tSNE1 etc.)

I think the best would be to have the UI look for those attributes and enable the buttons if they exist, otherwise not.

Possible solutions (forGene)

The basic function we want is to select a row (i.e. a gene) by the value of a row attribute (i.e. Gene). This makes it possible to color not just by attributes but also by values in the main matrix.

And vice versa for columns.

Some ideas:

  1. Standardize the attributes and enforce them. Pros: UI will work for any valid .loom file. Cons: Impossible to create .loom files without those attributes (which may not always be available), so users will have to fake them, and then the UI will break anyway
  2. Use standard attributes when they exist, otherwise fall back in a sensible way. Pros: UI doesn't break Cons: If Gene is named GeneName in some file, the gene selection will still not work, and we'll have to explain why.
  3. Figure out (client-side or server-side) which attribute is Gene, if it exists. Pros: UI works no matter what the gene name annotation is called Cons: seems difficult to get right (esp. for new species)
  4. Replace (gene) with (search) and have it search through all row attributes (or column attributes, for Gene/genescape view) Pros: UI will work no matter what; will allow searching for other things than gene name (e.g. accession codes) Cons: if you search for X (intending it to find a gene name), it may find X in some other attribute and show the wrong thing (need UI to show clearly what was found); may find multiple matches but can show only one.
  5. Like 4, but add a new dropdown for search, where user can select which attribute to search in. Cons: Clunky
  6. Like 4, but allow power-users to type something like "GeneName:Actb" to restrict the search to the "GeneName" attribute

Btw the same UI could work in Sparkline view, where you type in multiple gene name. By default, Loom would search through all attributes and show all matching rows. If the user prefixes the search with GeneName: then Loom would search only that attribute.

slinnarsson commented 8 years ago

Ok here's maybe a better variant of (6), and it should be relatively easy to impement. Here's how it would work in Cells (landscape) view:

  1. There's an item (find gene) in each of the drop-downs (X Coordinate, Y Coordinate and Color). It replaces (gene). Woud be great if it could be shown in clearly different style (e.g. as a light blue badge) to clearly show it's not one of the normal attributes.
  2. When the user selects (find gene), then (1) a new input box appears below (with hint: "Search") and (2) a new drop-down appears listing all column attributes.
  3. This new drop-down has items like "in Gene", "in Chromosome", "in Position" (i.e. column attributes prefixed by "in ") as well as one called "Everywhere". The "Everywhere" item is the default.
  4. When the user types, say, Actb in the search box, then (1) if "Everywhere" is selected, Loom searches through all row attributes looking for a match (it can be smart about searching only for strings); (2) if a specific attribute is selected, it searches only that attribute
  5. When a match is found (i.e. the first matching row), it will be in some attribute at some row.
  6. Loom fetches that row and uses it as X, Y or Color (as the case may be)
  7. Loom indicates next to the seach field (maybe in very small type just below) something like Actb found in GeneName, to indicate to the user where the match was.
  8. If the match was in the wrong attribute, the user can now simply select the desired attribute (e.g. GeneName) in the drop-down to adjust the search.

In Sparkline view, the same UI would work except the user can type in multiple values ((find genes) plural), and this leads to multiple rows being fetched. If the user searches "Everywhere" this could potentially give hits in multiple different attributes. I suggest we should first find which attribute has the largest number of hits, then redo the search limited to this attribute. The indicator would be 3 hits in GeneName.

In Gene (genescape) view, the same UI would work except the roles of rows and columns are reversed. The label would be (find cells).

Implementation

@JobLeonard I think it could be implemented in three stages:

  1. Simply change "(gene)" to "(find genes)"; make the code search through all attributes instead of just Gene; return the first row that matches. This requires only minimal changes to get us to a working UI that will not rely on predefined Gene attributes. Do this for Heatmap, Sparkline and Cells views.
  2. Add the additional drop-down and the logic described above
  3. Add search functionality to Gene/genescape view (where it doesn't exist currently), and to the "gene attributes" part of Heatmap.

Steps 2 and 3 could be treated as future enhancements at this point.

JobLeonard commented 8 years ago

Ah, that last scheme would make the client both hard to break and very flexible, which looks like it's best of both worlds. And I had already started preliminary work on a GeneFetcher component with dropdown, but I put that on hold because I figured getting heatmap/sparkline working had priority. So this looks perfectly sensible to me.

slinnarsson commented 8 years ago

As for the plot you made:

image

What you're seeing is probably the jitter I added to ensure the datapoints did not fall on top of each other. Basically, all gene expression is counted in low integers. If you have a thousand cells and most of them show values like 0, 1, 2 and 3, then you'll collapse almost all of them into a 4x4 grid of dots, and you lose any sense of how many cells are in each grid location.

What I did then was to jitter by adding a random number in [0, 0.5], and because the scale is logged in this case (automatically when you choose "(gene)") you get those squares that get smaller and smaller as you go up and right. The reason the data seems to be jumping is probably because the random numbers are added on each repaint.

Not great, so here's a couple of things we should put on the wish list:

JobLeonard commented 8 years ago

What you're seeing is probably the jitter I added to ensure the datapoints did not fall on top of each other. Basically, all gene expression is counted in low integers. If you have a thousand cells and most of them show values like 0, 1, 2 and 3, then you'll collapse almost all of them into a 4x4 grid of dots, and you lose any sense of how many cells are in each grid location.

To be honest, to me that sounds like this is not the right way to visualise this particular type of data at this scale. Maybe something like a bubble chart would make more sense? With surface area equal to cell count?

Regardless, those UI features look like they should be implemented, yes. I'll put it on the list.