JobLeonard commented 7 years ago

img_20170524_122438_dro-01

img_20170524_145004-01

We have a lot of different cases to consider:

When using string values, we treat these as categories. For that we need to label each category
when using numerical values, we have a scale from min to max, and it makes no sense to label each unique value
- what do we set as the min and maximum values on the axis?
- how many 'ticks' do we use and how do we space them?
- log2 support and spacing of ticks
- turning jitter on or off should not rescale the scatterplot (which happens at the moment)
for heatmap colouring, we should have a tiny legend that shows the minimum and maximum value on the heatmap, since we auto-adjust the colours to scale with the minimum/maximum values of the input.
- we could have two identical looking heatmaps, but one showing values from 1 to 10, and the other from 100 to 1000. This should be clear to the viewer

JobLeonard commented 7 years ago

Marks for axes need to automatically scale for numerical data, and we need sensible defaults.

Base "formula"

I think this works as a starting point:

spread ticks in multiples of of one of { 1, 2, 5 } * 10^n, where n is chosen such that total ticks is
- never less than 3 (exception: less than three unique values),
- never more than 10

If multiple answers fit the above criterium, choose the one with he most ticks

Example:

min value: 0, max value: 8, delta: 8. correct answers include:

(1 * 10^0) = 1 (8 ticks)
2 * 10^0 = 2 (4 ticks)

In this case, marks should be spread by one

Refinements

We might want to make "never less than 3, never more than 10" scale with available pixels (so for a big plot, allow for more ticks).

We should think of significant digits: if we only have the values 1, 2, 3 and 4 in our dataset, there's no point in showing ticks every 0.5 points.

We allow for log2 scaling. This might require some special logic, but perhaps a simple projection suffices.

slinnarsson commented 7 years ago

http://vis.stanford.edu/files/2010-TickLabels-InfoVis.pdf

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 24 May 2017, at 14:24, Job van der Zwan notifications@github.com wrote:

Marks for axes need to automatically scale for numerical data, and we need sensible defaults.

Base "formula"

I think this works as a starting point:

• spread ticks in multiples of of one of { 1, 2, 5 } * 10^n, where n is chosen such that total ticks is • never less than 3 (exception: less than three unique values), • never more than 10 If multiple answers fit the above criterium, choose the one with he most ticks

Example:

min value: 0, max value: 8, delta: 8. correct answers include:

• (1 10^0) = 1 (8 ticks) • 2 10^0 = 2 (4 ticks) In this case, marks should be spread by one

Refinements

We might want to make "never less than 3, never more than 10" scale with available pixels (so for a big plot, allow for more ticks).

We should think of significant digits: if we only have the values 1, 2, 3 and 4 in our dataset, there's no point in showing ticks every 0.5 points.

We allow for log2 scaling. This might require some special logic, but perhaps a simple projection suffices.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

JobLeonard commented 7 years ago

Aaah, thanks! I hadn't thought of looking for papers on the subject. Looks interesting, although like any good maths paper they introduce formulas without defining what the terms are. Oh well, nothing a little trial and error won't fix.

Moving orders of magnitude to the labels is a good idea, will be stealing that.

The article also mentions that they do not know of any research that has been done into what type of numbers are "nice". However, I know of at least one example.

See, the reason I went with 1, 2 and 5 was because I recall reading an article in the early 2000s about the research behind the creation of the Euro coin. Annoyingly, I can't find it now. Anyway, IIRC the choice to scale bank notes in this order had mathematical grounding:

it scales nicely: 0.5 is half of 1 is half of 2 is 2/5ths of 5 is half of 10. So always jumps by a 2 or 2.5
for humans it is relatively easy to do mental arithmetic with these numbers (partially because of previous property)
because of its fairly even growth rate, the scaling is near optimal for requiring the least number of coins and notes to pay any given amount of money (I bet the actual optimal scaling would probably involve Euler's number somehow, but that of course is anything but easy to do mental arithmetic with).

While the last property is not directly relevant for us (but interesting nonetheless), the first two suggest that it makes this a good scale to use for ticks.

(Also, I have to say I'm more than a little frustrated with how easily they decide what is "more" or "less" nice (4.1 on page three) after they state that there is no objective research into what makes a number nice, and then refer to numbers in a table without including plots with various settings for subjective comparison as an appendix - that kind of assumption that an "objective" formula is more "true" than doing proper user testing drives me up the wall)

JobLeonard commented 7 years ago

My crappy sketches that helped me think of all the things that need to be included

img_20170524_122438_dro-01

img_20170524_145004-01

Things that need to be calculated

font sizes, font angle
distance of axes of to the canvas border
position of x and y labels
position of min/max values
padding?
tick spacing
nr of ticks
- whether to skip labelling for ticks surrounding min/max values
gridlines (very subtle grey lines, can be two or four times as many as ticks)

JobLeonard commented 7 years ago

I realised today that I'm doing this all backwards: the axes are the hardest part, so do the labels and heatmap scale first, which are much easier and will still be useful without the axes:

Uploaded to the server too.

JobLeonard commented 6 years ago

New addition: labels on the clusters (only sensible in categorical mode).

Would require

averaging x/y position for each unique value in the colour attribute
displaying the label of each value. at averaged position

Not too complicated actually.

JobLeonard commented 6 years ago

Well, I got labels to work, but I still have to make sure they don't trigger for heatmap data, or when there are more than (say) 1000 unique values:

linnarsson-lab / loom-viewer

Axis and other scatterplot labels #102

Base "formula"

Refinements