linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Sparkline heatmap: normalise genes, allow for clamping range #108

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

Requested by Amit

JobLeonard commented 7 years ago

Don't laugh, it's been a while since I did this kind of stuff :P http://www.d.umn.edu/~deoka001/Normalization.html

slinnarsson commented 7 years ago

Another request:

Instead of two heatmap color ranges, replace one with a binary option (e.g. white/red) to more clearly shows zeros and non-zeros.

/Sten

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 31 May 2017, at 10:32, Job van der Zwan notifications@github.com wrote:

Don't laugh, it's been a while since I did this kind of stuff :P http://www.d.umn.edu/~deoka001/Normalization.html

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JobLeonard commented 7 years ago

I think we will get that for "free" once we introduce clamping: just set the upper bound of the clamping range to 0%. Or do I misunderstand the request?

JobLeonard commented 7 years ago

Side-track: flame map and how to do the heat maps.

So while I was implementing the flame map code I noticed an oddity in the heat map code: if the biggest outlier was non-zero, it would display that, and otherwise it would display the mean value. This has the effect of greatly increasing noise in the heat map plots, and suggesting a lot more signal then there really is.

What I think I intended to do was take the average of the two (remember that the algorithm we use for max outlier selects alternating outliers), compromising between noise profile and average.

I made a comparison overview below. I suggest opening them in new tabs, so you can quickly switch between them for comparison.

I used the Oligos All set, which is 200k cells, so we have around 200 cells per vertical column of pixels (which itself is 40 pixels, which kind of shows you the problems of accidentally hiding data that we are dealing with here).

First is a set sorted by clusters but semi-random otherwise, and then one sorted by total (so one should expect a more smooth gradient).

The plots are:

Sorted by clusters

Bars sort by clusters - bars

Flame map sort by clusters - flame

Heat map (average of greatest outlier + mean value) sort by clusters - heatmap outlier

Heat map (mean value) sort by clusters - heatmap means

Sorted by total

Bars sort by total - bars

Flame map sort by total - flame

Heat map (average of greatest outlier + mean value) sort by total - heatmap outlier

Heat map (mean value) sort by total - heatmap means

I think it's safe to say that the mean-value-only heat map smooths things out too much, and flame maps are better in every way compared to that.

However, when data is sparse there are only a few non-zero values, flame maps seem to have a similar problem to averaging away the non-zero values (see the upper six of the shown genes). The heat map that averages out the outlier/mean value is still useful in that situation, although they give the false impression of a bigger signal than there really is.

So I'm thinking of mixing the two: underneath the flame map, I will put a thin strip of heatmap. It's pretty clear which is which, but it will help spot the outliers a lot more. Here is a mock-up:

sort by clusters - flame-heatmap hybrid

Again, feedback is requested!

(I'm also wondering if the bar plots are really giving the proper impression of the data (funny how the flame map helps me reach that conclusion here))

JobLeonard commented 7 years ago

Problem: due to rounding, we can have two different bin sizes. For large datasets (like the Oligos one) this isn't really an issue - 100 or 101 cells doesn't make a big difference. For small datasets it leads to situations like this:

image

Some bins have two cells, some bins have three. This is a quite a significant difference in vertical height, as well as signal.

I think the best solution is to always plot the vertical graph as if it was the biggest bin size (in this case, as if there are three cells in it - effectively padding bins with two cells with a zero value).

JobLeonard commented 7 years ago

Results of fixing the bin size looks a bit funky, but in a way is also more clear about what the issue is.

What the data looks like for a small dataset, showing an attribute that we also sort by (so a smooth gradient should be the result), at various zoom levels:

1 or 2 cells per column: image

2 or 3: image

4 or 5: image

A slightly larger dataset (6000+ cells, so 4 or 5 cells per column), sorted by k-means as primary key, log-cv as secondary, and log-mean as tertiary:

image

Forebrain set (44k cells, so somewhere around 25 cells per bin) image

For large enough datasets the problem simply vanishes, because we reach the point where there are multiple cells per pixel: image

JobLeonard commented 7 years ago

Gioelle suggested that flame maps might also be useful for smaller datasets when printing, since those plots are often only a few centimetres across. This photo from a page in Erik Smedler's PhD thesis provides a quick example (although the time-vs-activity plots themselves are not directly comparable, obviously):

photo

So as a quick test, I opened the cortex.loom set (3005 cells), resized the browser window, and took screenshots comparing flame maps to bar plots and heatmaps. Since the latter two are using the LTTB outlier-selection algorithm in our browser I also made a quick temporary version of Loom where they display the mean values instead, to see how their smoother output compares to the flame graph:

image

flame map test 2

JobLeonard commented 7 years ago

@gioelelm suggested that we log-transform the data before determining the heatmap scale (specifically, log2(1 + value - minValue)). Below are some quick tests, it's should obvious which is with the projection and which is without it:

screenshot_20170608_143727 screenshot_20170608_143721 screenshot_20170608_143708 screenshot_20170608_143659 screenshot_20170608_144914 screenshot_20170608_144918

Anyway, Gio was pretty happy with it. The only thing that needs to be decided if whether we want log-scale to be on or off by default (Gio wants it on by default).