observablehq / plot

A concise API for exploratory data visualization implementing a layered grammar of graphics
https://observablehq.com/plot/
ISC License
4.3k stars 175 forks source link

Default bins too narrow using Plot.binX #417

Closed mkfreeman closed 3 years ago

mkfreeman commented 3 years ago

The default binning in Plot can create rects with a width that is too small to see (making a user -- namely, me -- believe the code is broken). As an example (also see this notebook):

Screen Shot 2021-05-25 at 6 12 33 PM

Compared to the same plot made using ggplot2:

ggplot_example
mbostock commented 3 years ago

Good find. I think we should consider not applying an inset ({inset: 0}) if there are a lot of bins:

untitled-82

We could also cap the number of bins returned by the default d3.thresholdScott based on the width of the chart:

untitled-81

Fil commented 3 years ago

Doesn't this mean we have two issues?

a) in some cases the default binning strategy creates too many bins

this might be addressed at the level of d3.thresholdScott, not based on the chart's width; or on the level of Plot.bin, based on the chart's width?

b) insets can "reverse" a rect and make it invisible

the formula is (in rect.js): .attr("width", i => Math.max(0, Math.abs(X2[i] - X1[i]) - this.insetLeft - this.insetRight))

here maybe the minimum width should be more than 0, perhaps 0.5? It would not eliminate all cases, in particular if you have a white stroke and the default fill-then-stroke paint order, but it would mean that a mark however small is never "zero-width".

This is a more general problem, for example when stacking values, the values that generate a rect that smaller than 1px can disappear from view if they have a white stroke. Should we decide we want all marks' geometries to be visible as described above, the rendering might still sometimes make them invisible, sometimes deliberately (fillOpacity: 0), sometimes unwittingly (stroke: white). I'm not sure it's possible to fix that, tbh, since we can't make a pixel be at the same time white and black.

It's not only rects, in practice we often have to add a half-pixel to a point's radius so that the colored surface area (after the inner part of the 1px white stroke has been deducted) is proportional to the value. The default r = sqrt(value) is not 100% correct and should be r = sqrt(value) + 1/2 strokeWidth if the stroke color is "substracting matter", eg white on a white background will make points smaller that .5px radius invisible. This is fixable by the user by setting r: {range: [.5, max] }, so that a value of 0 shows 0 color, but a value immediately > 0 shows a bit of color.

Fil commented 3 years ago

@severo has found a different avatar of this issue, when you bin on a single value: https://talk.observablehq.com/t/histogram-with-plot-does-not-show-a-bar-for-a-single-value-array/5111/2

Plot.rectY([{ weight: 3 }], Plot.binX({ y: "count" }, { x: "weight" })).plot()

Fil commented 3 years ago

Another difference is that Plot's (and D3's) bins conflate nulls and zeros, resulting in about 35k members in the first bin, whereas ggplot2 counts about 20k members (there are 15061 nulls in the data).

Fil commented 3 years ago
Capture d’écran 2021-05-27 à 10 19 23

(For this extrememly skewed distribution, d3.thresholdFreedmanDiaconis returns almost 3 times as many bins as there are values!)

mbostock commented 3 years ago

Filed https://github.com/d3/d3-array/issues/203 for the null conflation issue.

Fil commented 3 years ago

That was a productive issue with 3 [Edit: 4!] pull-requests :)

I don't know how to address @severo's example though: Plot.rectY([{ weight: 3 }], Plot.binX({ y: "count" }, { x: "weight" })).plot() creates one bin on the [3, 3] range, and it's hard to give it a width.

severo commented 3 years ago

We might look at how other libraries handle this case

Fil commented 3 years ago

In RStudio it creates a single bar taking the whole width, and with a total extent of 1 (from 2.5 to 3.5).

Capture d’écran 2021-05-29 à 19 26 10
Fil commented 3 years ago

A fix to @severo's issue is implemented in https://github.com/observablehq/plot/pull/438

mbostock commented 3 years ago

Calling this fixed by #421 (and #470), though see https://github.com/observablehq/plot/pull/422#issuecomment-896278326.