linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Scatterplot enhancements #60

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

A number of ways the scatterplot can be enhanced

labeled X/Y axis

Right now there are no displayed values. Sometimes that's appropriate (tSNE) but when using, say, gene expression for an x/y position this makes sense.

Assume string arrays represent categories

So, really basic insight I had while working on the violin plot: while sometimes strings are used as unique markers (CellID for the columns, Gene for the rows), they are also often a way to categorize (Class or Subclass for columns).

In the latter case, it doesn't seem unreasonable to me to "bin" the cells per category like a histogram. We can auto-choose the top 20 (or less) categories like we do for colours, create an even number of bins, then plot the points into column-wise into "sub-scatterplots".

I can kind-of approximate this already by using the backspin clustering:

image

We'd need to add a bit of jitter to avoid drawing all cells in one place, like we already to for genes. It would probably look somewhat similar to this:

image

Note the functional overlap with the violin plot (see #10). However, when zooming in on one gene, the scatterplot approach might be easier to "eyeball" and get an intuitive sense of the data.

We could also add an optional scatterplot background in the violin plot view, which would kind-of look like this I guess:

image

(from https://www.r-bloggers.com/part-3a-plotting-with-ggplot2/ )

Make jittering selectable by the user

It's clear that jitter is always necessary for gene expression, but there are also other situations when adding jitter makes sense (like when categorising strings, or using backspin values as x/y points). The problem is that not all situations where it makes sense can really be detected ahead of time; we won't know what kind of meta-data will be added to the rows and columns. So I think it has to be user selectable (while on by default for genes and string-based categorisation).

Move legend to settings tab

See #46

Interactivity

This is a bit of an umbrella, but I kind of lost track all the requested forms of interactivity. Sten, could you list what I'm missing here?

Allow for selecting multiple types of metadata for x- and y-coordinates and create a scatter matrix. This could be especially powerful if interactive and linked, like in this vega example::

vega vega 1

http://vega.github.io/vega-editor/index.html?mode=vega&spec=linking

JobLeonard commented 7 years ago

Check this out:

image

image

JobLeonard commented 7 years ago

Got genes working too:

image

image

slinnarsson commented 7 years ago

Nice!

Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Swedenx-apple-data-detectors://3/0 +46 8 52 48 75 77tel:+46%208%2052%2048%2075%2077 (office) +46 70 399 32 06tel:+46%2070%20399%2032%2006 (mobile)

15 nov. 2016 kl. 22:51 skrev Job van der Zwan notifications@github.com<mailto:notifications@github.com>:

Got genes working too:

[image]https://cloud.githubusercontent.com/assets/259840/20308159/eefae4ec-ab42-11e6-878e-4003680c3776.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/Loom/issues/60#issuecomment-260646199, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKagzGIjzPUghDQx64HXvj_Ttzc85BHks5q-bjugaJpZM4KF6Yz.

JobLeonard commented 7 years ago

I just I tried drawing a dataset with 26000+ points, and 5x5 scatterplots. It took about ten seconds to render. I think I can bring the rendering time back to below two seconds, by combining reduced plots + sprites. Still choppy, but a world of difference in how it feels.

First, I'm thinking of changing the code to plot each combination of attributes only once; it's less redundant information on the screen, increasing clarity. Also, for n attributes/genes, plotting n nCr 2 instead of means reducing work by at least 1/2 as much work.

Also, switching to using sprites for the datapoints should give a speed boost - when I did some basic tests with this a few months it was a lot faster, but it would have complicated the plotter code a lot at the time. However, the plotting code has been simplified a lot, due to architectural re-organisations. So it should be easier to implement now.

JobLeonard commented 7 years ago

I may have accidentally found an interesting way to improve the information of really dense plots: sort by x/y attributes (although sadly the result could be considered a bit uglier).

Basically, I was messing around and noticed that sometimes the plots looked a bit weird. Then I realise I had sorted the data by the x/y attributes by accident. Below I have comparison images, the first being the original order of the columns, the other sorted by x/y attributes. Click for zoomed in versions.

image

image

What you can see in the tSNE plot is that the in second version is that the circles are tiled like scales on a fish. While slightly distracting, this means that one point has to be perfectly covered by another to be hidden. So we are much less likely to "hide" one colour, and we can immediately see that in the difference in visible categories. That also makes it much more obvious if a region is actually really noisy for a given colour ramp.

Below the SFDP plot, with Louvain-Jacquard (LJ)colouring. Following sort keys (and thus drawing order) was used:

image

image

image

image

What the second and third picture of these four hopefully illustrate is how misleading these images can be if we're not careful: some clusters are completely hidden by others (and note that before this week, we drew in order of colour, so this was the norm)

In the fourth image, the edges between regions are much "fuzzier" than before, representing the underlying data better. There's also more previously-hidden lighter dots in the upper right regions.

image

image

Finally, for this PCA plot we can see that there are a lot of values in such an incredibly dense spot that it becomes impossible to distinguishing the actual categories. This might actually be a good thing: currently, once a region is "saturated" with circles (meaning it's all covered), it's hard to see if the "density" of datapoints is higher than that. This somewhat mitigates that.

This is no substitute for better dot-scaling of course, which will be implemented. But it's an interesting workaround for now (or maybe even in combination with).

I could make the buttons that set the x/y coordinates to tSNE/PCA/SFDP also trigger the appropriate sorting-by. What do you think?

JobLeonard commented 7 years ago

More examples, with gene-based colouring this time:

image

image

Note how in the second picture the combination of outline stroke+density creates shaded areas. And one more for a gene with many zeros:

image

image

The differences here are subtle, and probably only notable if you look open the full-screen versions and quickly switch between tabs.

Lars came to me earlier with the suggestion of first drawing all the "uncoloured" values (so zeroes for heatmaps, or "other" for categories), and then draw everything else. I think there might be some validity in that - in both of these examples some measured values are almost hidden by the outlines of zero-values.

(also, I will soon have filter-by-gene values working, in which case we can filter out zero-values for genes)

JobLeonard commented 7 years ago

A funny side-effect of this method of drawing is that it creates a false sense of perspective. This is most obvious if we compare different orders:

image

image

image

image

The last version is the easiest to read and "calmest" option for me; it evokes the feeling of looking from above at a crowd or a forest. In general I think it helps to rely on features of perception the human eye is trained on. Some more examples of the last tiling order:

image

image

image

I'm going to go ahead and implement this now, since it's a trivial thing to implement compared a slider for relative scaling; I'll do the latter too of course.

JobLeonard commented 7 years ago

We almost have bin-by-stringname working:

image

But there's still a few weird bugs:

image

JobLeonard commented 7 years ago

Fixed! And in the process fixed a few other sneaky bug with the sprites and coloring:

image

JobLeonard commented 7 years ago

I've changed the scatterplot so that it now always sorts by x- and y- axis. This was needed in order for it to work with jittering. If you think it's better without, I can make this an optional setting in the side-panel instead

image

image

It also now paints the "other" values first, then the values that actually have a category (or are non-zero, in the case of heatmap). This was a request by Lars. Again, this can be made a setting that can be turned on and off