JobLeonard commented 7 years ago

Turns out someone wrote an entire master thesis on writing a downsampling algorithm optimised for visual importance:

http://flot.base.is/ https://github.com/joshcarr/largest-triangle-three-buckets.js http://skemman.is/handle/1946/15343

While the algorithm is designed for line-charts, I'd be quite surprised if the same principle wouldn't hold for bar charts.

... I wonder if one could extend this to two dimensions, to better downsample the heat map? Or, you know, any visual image.

JobLeonard commented 7 years ago

Turns out the algorithm is surprisingly simple, I dare say elegant even:

Divide data-set into bins. Always use a single datapoint for the first bin
Scan through the bins from left-to-right to select the point to keep in the downsampled version:
- for the first bin, select the only point
- for every other bin, select the point which generates the largest triangle area when combined with the selected value of the previous bin, and the mean value of the next bin (see illustration below

(image captured from page 22. of Sveinn Steinarsson's master thesis)

This is kind of a moving-average approach, which I guess makes sense in the original context of time-variant data. However, this doesn't apply to us, since we're averaging across cells that could be sorted in any order. Steinarsson also describes a dynamic bin-size approach, but that again is not relevant since we are not downsampling to points on lines, but to columns with a width of one pixel.

So what I suggest to do instead is the following, more symmetrical approach:

Divide data-set into evenly spaced bins matching the available pixel width/height (for horizontal/vertical bar plots respectively)
Calculate mean values for each bin
For each bin, select the point which generates the largest triangle when compared to the means of the surrounding bins

The only difference between our current approach is that we add the last step, since we already do the first. Also, to answer my own question:

... I wonder if one could extend this to two dimensions, to better downsample the heat map? Or, you know, any visual image.

I think the answer is "yes". Just replace "triangle area" with "pyramid volume" and we are done! Whether or not the results are useful remains to be seen, of course.

JobLeonard commented 7 years ago

BTW, this is how I would implement this for the heatmap, if we want to try this. I'll do my best to describe it in words properly.

TL;DR: rewrite as a bottom-up divide and conquer method.

We start at the most zoomed in level. Let's call this z0. We zoom out by two pixels on each side per jump. So z1 represents a 2x2 underlying area of data points, z2 a 4x4 area, z3 a 8x8 area etc.

We calculate z1 using z0 as expected: select the visually-most-important data point of a 2x2 area captured by each tile, using the method described below. Now here's the important bit: we also save the average value of the underlying values (which we had to calculate anyway to rank the data points).

(side-note: calculate this average by multiplying by 0.25, not by dividing by 4. I'm not joking, it's faster)

Now, for z2, we don't calculate the values using z0 but z1. Use a 2x2 area of tiles, select the visually most important data point in almost the same way as before. The only difference is that we calculate the mean values by averaging the averages stored by each tile. And then we repeat this: use z2 as the basis for z3, z3 as the basis for z4, etc.

Since all averages represent an equal number of underlying data points, averaging the averages is the same as averaging all underlying data points at the z0 level¹.

As for the visually most important data point, the assumption we make here is that this would be one of the previously selected data points anyway. Even if that isn't the case, a benefit here would be that zooming in and out of tiles will be more consistent, since this method guarantees that any datapoint seen in a zoomed out tile is also present in all tiles zoomed in below it.

So, the net result of all this is that we half the number of averages and data points to compare at each zoom level for each dimension. So instead of our algorithm being z * r * c, where z = number of zoom levels, r = rows and c = columns, our overhead is sum(1/z²) * r * c. This converges to π/6 * r * c for large z. Should be survivable, no?

[1]: Since the dimensions of our heatmap are probably not a perfect power of 2, our areas end up not being exact fits at the edges. If we want to compensate for this, we could use a weighed average. The weights would be total area represented at the most zoomed in level; keeping track of this is a simple matter of adding up the weights for each average every time we zoom out. I'd ignore this; it only affects the pixels at the edges, and it would both complicate and slow down the algorithm a lot.

Keeping the algorithm simple in 2D

So, going from calculating the triangle area to calculating the captured volume turns out to be trickier than thought: we're not dealing with a simple shape but an irregular convex hull. However, we can avoid all the complicated calculations that would bring.

First I'd like to draw the attention to the following:

Given point A, B and C, where

A & C are the mean values of the surrounding bins
B is one of the values in the current bucket

The above algorithm tells us to select the B for which the triangle area is the largest. However, the above illustration is a bit misguiding: we should set the x-positions of A, B and C to the middle of their respective bins, and the y-position equal to their respective values. But this is equivalent to finding the B for which |B - (A+C)/2| is largest!

In other words: we want to find the B with the biggest difference with the mean value of the surrounding means. And this metric remains simple to calculate when extended to 2D (and higher dimensions). So let's try that first.

JobLeonard commented 7 years ago

Visual comparison.

Remember to click for large versions

Mean values:

means

Largest Triangle Three Buckets, asymmetrical:

largest triangle three buckets

Largest Triangle Three Buckets, symmetrical:

largest triangle three buckets symmetrical

@gioelelm opinions?

JobLeonard commented 7 years ago

One more example. I figured that since genes that the scheme might have downsides when looking for genes that are uhm... what's the term, co-expressed? The idea being that is I sort by one particular gene, you'd expect the co-expressed genes to show a similar slope. Since LTTB favours outliers, one would expect the slope becomes harder to spot. Eyeball for yourself below

Means on top, symmetrical LTTB on the bottom: sorted-correlation

JobLeonard commented 7 years ago

I've just run this by Josje, Nathan and Lars, they seem to prefer the LLTB plot - especially once proper zooming in implemented. So I'll just keep the LLTB and close this issue for now.

Still, given the behaviour of the algorithm it's important that we make sure people are aware of how to interpret this type of downsampling I think. Also, we may want to think about this again once we hit really large data-sets; perhaps we should have both mean sampling and LLTB sampling as options for the user (I prefer to avoid adding too many options, to avoid overwhelming them with unnecesary knobs and buttons)

JobLeonard commented 7 years ago

I discovered a (sadly, quite common) edge-case with the algorithm: zooming out very far on dense data with high variability.

First, an example of LLTB working as we want it to, highlighting the outliers as we zoom out:

Now dense datasets (that is, few zero-values). I'll start 1:1 and zoom one one step at a time.

Well, that didn't quite go as hoped for... After thinking a bit about this the reason this happens becomes obvious:

LTTB always selects the value with the biggest difference from the mean value of the next bucket, and the selected previous value. This will always be either the maximum or minimum value in a bucket
Because it looks at the previously selected value, it tends to alternate between minima and maxima
The previous two points together means that for certain datasets, we'll just end up with zebra-stripes.

So sometimes mean values give more useful information... but I don't want to switch back to that either because of the sparser datasets. So we need some kind of hybrid option.

What I tried above is displaying both mean values and LTTB, by including the mean values of a bucket as grey points. When we're zoomed in 1:1 or further, they just lie on top of the charts. When zooming out, you can see how the mean values differ from the selected bar value. Especially in the last few pictures the effect is quite notable.

In the following image we see another example of how this helps:

We can see on the vertical bar chart that the values are one almost throughout, but zero in a few areas. Places where there are a few values in the bucket that are not one can be identified by the lower mean value.

Here is how the added mean values look in the sparklines view:

I think it is still very readable, once you know how the chart works. The problem is that this is non-standard, so we will have to explain it. Then again, this was true for LTTB already anyway.

@slinnarsson, @simone-codeluppi, @gioelelm, is this a good addition, and is it readable as is, or would you prefer further visual tweaks to it?

(PS: thinking about this even further, the algorithm might actually be interesting when applied to many dimensions, comparing Euclidian distance. In that case it will select a point on the convex hull of the dataset that is furthest away from the calculated mean value)

linnarsson-lab / loom-viewer

Use (try?) Largest-Triangle-Three-Buckets downsampling algorithm for sparklines #58

Visual comparison.

Mean values:

Largest Triangle Three Buckets, asymmetrical:

Largest Triangle Three Buckets, symmetrical: