`stat_bin()` should have the area (instead of height) represent the count.

mattansb commented 1 month ago

Histograms convert counts withing bins into areas.

However, in ggplot2, the default behavior is to convert counts to bar heights.

This discrepancy is typically not noticeable, because stat_bin() default to equi-width bins. However it becomes apparent when using non-equi-width bins.

Here is an example with equi-probable bins, in which each column should have the same area.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.3


breaks <- quantile(mtcars$mpg, probs = seq(0, 1, len = 5))

cut(mtcars$mpg, 
    breaks = breaks, 
    include.lowest = TRUE) |> 
  table()
#> 
#> [10.4,15.4] (15.4,19.2] (19.2,22.8] (22.8,33.9] 
#>           8           9           8           7


ggplot(mtcars, aes(mpg)) +
  stat_bin(color = "black", breaks = breaks)

# default: mapping = aes(y = after_stat(count))

But it should look like this:

ggplot(mtcars, aes(mpg)) +
  stat_bin(color = "black", breaks = breaks,
           mapping = aes(y = after_stat(sum(count) * density)))

^{Created on 2024-05-20 with reprex v2.1.0}

I suggest:

Adding a new computed variables (perhaps called height) that is equal to sum(count) * density
Defaulting to this variable for setting the bars' heights, at least when using this stat when stat_bin() with non-equi-width bins.

teunbrand commented 1 month ago

Hi thanks for the suggestions!

Yes, using areas for histograms satisfies the proportional ink principle, but below are a few reasons I don't think we should do it.

Users have come to expect counts by default. We have parted with defaults before, but I don't think we should depart a very clear and simple metric (counts) in favour of more complicated metrics.
Counts and the proposed metric are only the same when the width of the bars are 1. If you replace the breaks by binwidth = 0.01, you see several values reach 200 with the proposed metric, whereas the data only has 32 observations in total.
after_stat(sum(count) * density) sums the counts over groups, which it shouldn't as density is calculated within groups. The appropriate metric would be after_stat(count / width). As this is available as a simple combination of already available computed variables, I don't think this merits a novel computed variable.

mattansb commented 1 month ago

Yes, changing a default is a pain... IMO it's worth it, but I don't have a huge community to serve ;)

At the very least, I think this should be written somewhere in the docs (as this is how histograms are commonly defined*). Additionally, an example with after_stat(count / width) can be added, with or without (or both) non-equi-width bins.

I'm willing to make (the world's smallest) PR if you'd like.

If you replace the breaks by binwidth = 0.01, you see several values reach 200 with the proposed metric, whereas the data only has 32 observations in total.

I don't see this as an issue - in PDFs, densities can also exceed 1 - it's just stats being stats 🤷‍♂️

* I only came to notice this when I was teaching histograms and a student pointed out that my plot didn't match what I had just said.

teunbrand commented 1 month ago

it's just stats being stats

Agreed, but it was meant to illustrate how it departed from counts even for equi-bins 🤓

Adding an example is a good idea, we'd welcome a PR for this.

tidyverse / ggplot2

`stat_bin()` should have the area (instead of height) represent the count. #5895