tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.39k stars 2k forks source link

`stat_bin()` should have the area (instead of height) represent the count. #5895

Open mattansb opened 1 month ago

mattansb commented 1 month ago

Histograms convert counts withing bins into areas.

However, in ggplot2, the default behavior is to convert counts to bar heights.

This discrepancy is typically not noticeable, because stat_bin() default to equi-width bins. However it becomes apparent when using non-equi-width bins.

Here is an example with equi-probable bins, in which each column should have the same area.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.3

breaks <- quantile(mtcars$mpg, probs = seq(0, 1, len = 5))

cut(mtcars$mpg, 
    breaks = breaks, 
    include.lowest = TRUE) |> 
  table()
#> 
#> [10.4,15.4] (15.4,19.2] (19.2,22.8] (22.8,33.9] 
#>           8           9           8           7

ggplot(mtcars, aes(mpg)) +
  stat_bin(color = "black", breaks = breaks)

# default: mapping = aes(y = after_stat(count))

But it should look like this:

ggplot(mtcars, aes(mpg)) +
  stat_bin(color = "black", breaks = breaks,
           mapping = aes(y = after_stat(sum(count) * density)))

Created on 2024-05-20 with reprex v2.1.0

I suggest:

  1. Adding a new computed variables (perhaps called height) that is equal to sum(count) * density
  2. Defaulting to this variable for setting the bars' heights, at least when using this stat when stat_bin() with non-equi-width bins.
teunbrand commented 1 month ago

Hi thanks for the suggestions!

Yes, using areas for histograms satisfies the proportional ink principle, but below are a few reasons I don't think we should do it.

mattansb commented 1 month ago

Yes, changing a default is a pain... IMO it's worth it, but I don't have a huge community to serve ;)

At the very least, I think this should be written somewhere in the docs (as this is how histograms are commonly defined*). Additionally, an example with after_stat(count / width) can be added, with or without (or both) non-equi-width bins.

I'm willing to make (the world's smallest) PR if you'd like.


If you replace the breaks by binwidth = 0.01, you see several values reach 200 with the proposed metric, whereas the data only has 32 observations in total.

I don't see this as an issue - in PDFs, densities can also exceed 1 - it's just stats being stats 🤷‍♂️


* I only came to notice this when I was teaching histograms and a student pointed out that my plot didn't match what I had just said.

teunbrand commented 1 month ago

it's just stats being stats

Agreed, but it was meant to illustrate how it departed from counts even for equi-bins 🤓

Adding an example is a good idea, we'd welcome a PR for this.