Closed JobLeonard closed 7 years ago
(Aside: on the plus side, we got "x-by-genes" working again! Even though the resulting plots are nonsensical and there's a weird bug where only a random selection of points is plotted)
Hm yes that's hard UI challenge.
The original idea was, "let's have standard attributes like Gene
, so we can assume they are always around, and simplify the UI". The ones we use currently I think are:
_tSNE1, _tSNE2 (t-SNE coordinates)
_PC1, _PC2 (PCA coordinates)
_Gene (gene names)
There are also others that are automatically generated by the pipeline, but where don't exploit them client-side currently (e.g. _LogCV
, _LogMean
, _Total
, _Noise
, _Excluded
).
The benefit is that (for example) we can let the user select a row in the matrix based on a gene name, with no need for the user to specify where to search for that gene name. But this breaks silently if there's no Gene
row attribute.
I think the best would be to have the UI look for those attributes and enable the buttons if they exist, otherwise not.
Gene
)The basic function we want is to select a row (i.e. a gene) by the value of a row attribute (i.e. Gene
). This makes it possible to color not just by attributes but also by values in the main matrix.
And vice versa for columns.
Some ideas:
.loom
file.
Cons: Impossible to create .loom
files without those attributes (which may not always be available), so users will have to fake them, and then the UI will break anywayGene
is named GeneName
in some file, the gene selection will still not work, and we'll have to explain why.Gene
, if it exists.
Pros: UI works no matter what the gene name annotation is called
Cons: seems difficult to get right (esp. for new species)(gene)
with (search)
and have it search through all row attributes (or column attributes, for Gene/genescape view)
Pros: UI will work no matter what; will allow searching for other things than gene name (e.g. accession codes)
Cons: if you search for X (intending it to find a gene name), it may find X in some other attribute and show the wrong thing (need UI to show clearly what was found); may find multiple matches but can show only one.Btw the same UI could work in Sparkline view, where you type in multiple gene name. By default, Loom would search through all attributes and show all matching rows. If the user prefixes the search with GeneName:
then Loom would search only that attribute.
Ok here's maybe a better variant of (6), and it should be relatively easy to impement. Here's how it would work in Cells (landscape) view:
(find gene)
in each of the drop-downs (X Coordinate
, Y Coordinate
and Color
). It replaces (gene)
. Woud be great if it could be shown in clearly different style (e.g. as a light blue badge) to clearly show it's not one of the normal attributes.(find gene)
, then (1) a new input box appears below (with hint: "Search") and (2) a new drop-down appears listing all column attributes.Actb
in the search box, then (1) if "Everywhere" is selected, Loom searches through all row attributes looking for a match (it can be smart about searching only for strings); (2) if a specific attribute is selected, it searches only that attributeActb found in GeneName
, to indicate to the user where the match was. GeneName
) in the drop-down to adjust the search.In Sparkline view, the same UI would work except the user can type in multiple values ((find genes)
plural), and this leads to multiple rows being fetched. If the user searches "Everywhere" this could potentially give hits in multiple different attributes. I suggest we should first find which attribute has the largest number of hits, then redo the search limited to this attribute. The indicator would be 3 hits in GeneName
.
In Gene (genescape) view, the same UI would work except the roles of rows and columns are reversed. The label would be (find cells)
.
@JobLeonard I think it could be implemented in three stages:
Gene
; return the first row that matches. This requires only minimal changes to get us to a working UI that will not rely on predefined Gene
attributes. Do this for Heatmap, Sparkline and Cells views.Steps 2 and 3 could be treated as future enhancements at this point.
Ah, that last scheme would make the client both hard to break and very flexible, which looks like it's best of both worlds. And I had already started preliminary work on a GeneFetcher component with dropdown, but I put that on hold because I figured getting heatmap/sparkline working had priority. So this looks perfectly sensible to me.
As for the plot you made:
What you're seeing is probably the jitter I added to ensure the datapoints did not fall on top of each other. Basically, all gene expression is counted in low integers. If you have a thousand cells and most of them show values like 0, 1, 2 and 3, then you'll collapse almost all of them into a 4x4 grid of dots, and you lose any sense of how many cells are in each grid location.
What I did then was to jitter by adding a random number in [0, 0.5], and because the scale is logged in this case (automatically when you choose "(gene)") you get those squares that get smaller and smaller as you go up and right. The reason the data seems to be jumping is probably because the random numbers are added on each repaint.
Not great, so here's a couple of things we should put on the wish list:
What you're seeing is probably the jitter I added to ensure the datapoints did not fall on top of each other. Basically, all gene expression is counted in low integers. If you have a thousand cells and most of them show values like 0, 1, 2 and 3, then you'll collapse almost all of them into a 4x4 grid of dots, and you lose any sense of how many cells are in each grid location.
To be honest, to me that sounds like this is not the right way to visualise this particular type of data at this scale. Maybe something like a bubble chart would make more sense? With surface area equal to cell count?
Regardless, those UI features look like they should be implemented, yes. I'll put it on the list.
So I just found an issue in the fetchGene action that basically boiled down to both the server and .loom API having changed. The former was a change in endpoints due to the shift from docker to a CLI tool, which was well documented. The other was that
<dataset>.rowAttr.GeneName
had changed to<dataset>.rowAttr.Gene
, which was not specified anywhere.Currently, the
.loom
file format specification makes no mention of this - I suppose because the Gene field is optional anyway.However the client-side code does make certain assumptions about which datasets will be present in
colAttr
s androwAttr
s, and what type of data they will contain (Gene
documents the gene names, so you'd expect strings instead of integers or floats).It might be a good idea to write those down. After all, they are essentially "keyword" fields with special meaning within the client-side code (it might just be that Gene is the only one at the moment, but still).
This would also reduce the amount of reverse-engineering for me (and others, once they start using this tool).