Error diffusion and/or stacked categories for sparkline plots

JobLeonard commented 7 years ago

Ok, so I've been thinking about this in my spare time, and I'm writing down these ideas before I forget them again. It's not a high-priority issue at all.

This is a little pet peeve I have.

Right now, depending on the sorting of the data, categories may be completely hidden away:

The reason is that we calculate the "category winner" per bucket (a vertical line) for every bucket, whatever zeroes are present, it's less than the number of ones.

This is actually a common problem in plotting statistics: the rounding effects of binning.

So I was thinking: maybe dithering can help here.

Of course, standard dithering is based on continuous values being represented by a smaller range digital values (for example, 256 shades of gray converted to black and white), and relying on contextual information averaging out those values. However, here we are dealing with a group of categories per vertical line, each with a count (given a m+n datapoints for a line, m times 1, n times 0), and select one of the available values as the "winner". There is no contextual blurring to apply here.

However, what I'm thinking of is more based on how dithering is done: error diffusion. If we can "spread out" the rounding error across categories, we might get a more accurate picture.

One-dimensional error diffusion

The simplest form of the algorithm scans the image one row at a time and one pixel at a time. The current pixel is compared to a half-gray value. If it is above the value a white pixel is generated in the resulting image. If the pixel is below the half way brightness, a black pixel is generated. The generated pixel is either full bright, or full black, so there is an error in the image. The error is then added to the next pixel in the image and the process repeats.

https://en.wikipedia.org/wiki/Error_diffusion

We could try keeping track of the error per category, and then select the winner based on the sum of bias + actual number of values in the bin (for multiple categories, just keep track of individual bias per category).

Let's take the the image above as an example. Imagine bins of ten values, with each value either zero or one, and the overwhelming majority of values is one. After going through many bins with only ones, we come across a bin with 6 ones, and 4 zeroes. We plot the colour for one, keep track of the rounding error as a bias for the next bins (so +4 for zeroes). Somewhat later, we come across another mixed bin: 8 ones, 2 zeroes. Even with added bias, ones win, so we select it, increase the zero bias to +6 and move on. From now on, we'll one see bins with 9 ones, 1 zero. Category one wil keep winning until the zero-bias reaches 8 and it's a tie (in which case we'll favour the strongest bias).

Now this would be very misleading: the bin with zeros that got painted had the least number of zeroes of all the bins. Also, there is no indication where this bin is placed - imagine all bins before it are early on in the array, but the winning one is all the way to the end.

So an approach this simple would also create false images. There must be a smarter solution for this, but I'm not quite there yet. Again, will think about this in my spare time.

Another option would be to use a stacked graph, which is much easier to implement:

http://www.betterevaluation.org/en/evaluation-options/stacked_graph

(imagine no spacing between the bars)

So what we'd do is that after binning, we sort the values, then draw them vertically on top of each other. Let's say we have 10 datapoints per bin, and 20 pixels of height. If we have a bin with 3 zeroes and 7 ones, we'd draw the 3 zeroes as a vertical line six pixel long on the top, and the 7 ones as a line 14 pixels below it. And so on.

The problem with this could be that it's computationally quite expensive to change colours all the time and draw so many small, individual lines - a lot of my optimisations for speed revolve around avoiding both. For some datasets that have a lot of categorical attributes this can get very slow (the 45k Forebrain set is a good example, with 36 metadata attributes that default to categorical). But we'll have to see how this works out in practice.

JobLeonard commented 7 years ago

Stacked categories has been implemented, and to my surprise it's not all that heavy on the CPU at all!

Notice how much cluster 27 gave the false impression of being the only category on the right side before.

A more subtle but in a way more significant example:

Before excluded just wasn't visible at all. Now it's subtly visible on the top part.

For now you can cycle between "winner takes all" and "stacked" categorical plotting. If the latter turns out to be preferred I'll get rid of the former, because it might be visually simpler on the eyes, but it does lead to misleading images.

JobLeonard commented 7 years ago

OMG, compare these two shots:

linnarsson-lab / loom-viewer

Error diffusion and/or stacked categories for sparkline plots #101