Scatterplot enhancements

JobLeonard commented 7 years ago

A number of ways the scatterplot can be enhanced

labeled X/Y axis

Right now there are no displayed values. Sometimes that's appropriate (tSNE) but when using, say, gene expression for an x/y position this makes sense.

Assume string arrays represent categories

So, really basic insight I had while working on the violin plot: while sometimes strings are used as unique markers (CellID for the columns, Gene for the rows), they are also often a way to categorize (Class or Subclass for columns).

In the latter case, it doesn't seem unreasonable to me to "bin" the cells per category like a histogram. We can auto-choose the top 20 (or less) categories like we do for colours, create an even number of bins, then plot the points into column-wise into "sub-scatterplots".

I can kind-of approximate this already by using the backspin clustering:

We'd need to add a bit of jitter to avoid drawing all cells in one place, like we already to for genes. It would probably look somewhat similar to this:

Note the functional overlap with the violin plot (see #10). However, when zooming in on one gene, the scatterplot approach might be easier to "eyeball" and get an intuitive sense of the data.

We could also add an optional scatterplot background in the violin plot view, which would kind-of look like this I guess:

(from https://www.r-bloggers.com/part-3a-plotting-with-ggplot2/ )

Make jittering selectable by the user

It's clear that jitter is always necessary for gene expression, but there are also other situations when adding jitter makes sense (like when categorising strings, or using backspin values as x/y points). The problem is that not all situations where it makes sense can really be detected ahead of time; we won't know what kind of meta-data will be added to the rows and columns. So I think it has to be user selectable (while on by default for genes and string-based categorisation).

Move legend to settings tab

See #46

Interactivity

This is a bit of an umbrella, but I kind of lost track all the requested forms of interactivity. Sten, could you list what I'm missing here?

hovering over legend highlighting the cells matching that category/heatmap value
zooming/panning
selecting cells by dragging around them and then.. something. This is where my notes fail me
some form of cross-filtering perhaps? The question is what predicates to filter on
Scatter Matrix

Allow for selecting multiple types of metadata for x- and y-coordinates and create a scatter matrix. This could be especially powerful if interactive and linked, like in this vega example::

vega vega 1

http://vega.github.io/vega-editor/index.html?mode=vega&spec=linking

JobLeonard commented 7 years ago

Check this out:

JobLeonard commented 7 years ago

Got genes working too:

slinnarsson commented 7 years ago

Nice!

Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Swedenx-apple-data-detectors://3/0 +46 8 52 48 75 77tel:+46%208%2052%2048%2075%2077 (office) +46 70 399 32 06tel:+46%2070%20399%2032%2006 (mobile)

15 nov. 2016 kl. 22:51 skrev Job van der Zwan notifications@github.com<mailto:notifications@github.com>:

Got genes working too:

[image]https://cloud.githubusercontent.com/assets/259840/20308159/eefae4ec-ab42-11e6-878e-4003680c3776.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/Loom/issues/60#issuecomment-260646199, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKagzGIjzPUghDQx64HXvj_Ttzc85BHks5q-bjugaJpZM4KF6Yz.

JobLeonard commented 7 years ago

I just I tried drawing a dataset with 26000+ points, and 5x5 scatterplots. It took about ten seconds to render. I think I can bring the rendering time back to below two seconds, by combining reduced plots + sprites. Still choppy, but a world of difference in how it feels.

First, I'm thinking of changing the code to plot each combination of attributes only once; it's less redundant information on the screen, increasing clarity. Also, for n attributes/genes, plotting n nCr 2 instead of n² means reducing work by at least 1/2 as much work.

Also, switching to using sprites for the datapoints should give a speed boost - when I did some basic tests with this a few months it was a lot faster, but it would have complicated the plotter code a lot at the time. However, the plotting code has been simplified a lot, due to architectural re-organisations. So it should be easier to implement now.

JobLeonard commented 7 years ago

I may have accidentally found an interesting way to improve the information of really dense plots: sort by x/y attributes (although sadly the result could be considered a bit uglier).

Basically, I was messing around and noticed that sometimes the plots looked a bit weird. Then I realise I had sorted the data by the x/y attributes by accident. Below I have comparison images, the first being the original order of the columns, the other sorted by x/y attributes. Click for zoomed in versions.

What you can see in the tSNE plot is that the in second version is that the circles are tiled like scales on a fish. While slightly distracting, this means that one point has to be perfectly covered by another to be hidden. So we are much less likely to "hide" one colour, and we can immediately see that in the difference in visible categories. That also makes it much more obvious if a region is actually really noisy for a given colour ramp.

Below the SFDP plot, with Louvain-Jacquard (LJ)colouring. Following sort keys (and thus drawing order) was used:

(original order
LJ (asc), SFDP_Y, SFDP_X
LJ (desc), SFDP_Y, SFDP_X
SFDP_Y, SFDP_X

What the second and third picture of these four hopefully illustrate is how misleading these images can be if we're not careful: some clusters are completely hidden by others (and note that before this week, we drew in order of colour, so this was the norm)

In the fourth image, the edges between regions are much "fuzzier" than before, representing the underlying data better. There's also more previously-hidden lighter dots in the upper right regions.

Finally, for this PCA plot we can see that there are a lot of values in such an incredibly dense spot that it becomes impossible to distinguishing the actual categories. This might actually be a good thing: currently, once a region is "saturated" with circles (meaning it's all covered), it's hard to see if the "density" of datapoints is higher than that. This somewhat mitigates that.

This is no substitute for better dot-scaling of course, which will be implemented. But it's an interesting workaround for now (or maybe even in combination with).

I could make the buttons that set the x/y coordinates to tSNE/PCA/SFDP also trigger the appropriate sorting-by. What do you think?

JobLeonard commented 7 years ago

More examples, with gene-based colouring this time:

Note how in the second picture the combination of outline stroke+density creates shaded areas. And one more for a gene with many zeros:

The differences here are subtle, and probably only notable if you look open the full-screen versions and quickly switch between tabs.

Lars came to me earlier with the suggestion of first drawing all the "uncoloured" values (so zeroes for heatmaps, or "other" for categories), and then draw everything else. I think there might be some validity in that - in both of these examples some measured values are almost hidden by the outlines of zero-values.

(also, I will soon have filter-by-gene values working, in which case we can filter out zero-values for genes)

JobLeonard commented 7 years ago

A funny side-effect of this method of drawing is that it creates a false sense of perspective. This is most obvious if we compare different orders:

The last version is the easiest to read and "calmest" option for me; it evokes the feeling of looking from above at a crowd or a forest. In general I think it helps to rely on features of perception the human eye is trained on. Some more examples of the last tiling order:

I'm going to go ahead and implement this now, since it's a trivial thing to implement compared a slider for relative scaling; I'll do the latter too of course.

JobLeonard commented 7 years ago

We almost have bin-by-stringname working:

But there's still a few weird bugs:

JobLeonard commented 7 years ago

Fixed! And in the process fixed a few other sneaky bug with the sprites and coloring:

JobLeonard commented 7 years ago

I've changed the scatterplot so that it now always sorts by x- and y- axis. This was needed in order for it to work with jittering. If you think it's better without, I can make this an optional setting in the side-panel instead

It also now paints the "other" values first, then the values that actually have a category (or are non-zero, in the case of heatmap). This was a request by Lars. Again, this can be made a setting that can be turned on and off

linnarsson-lab / loom-viewer