Open mattansb opened 1 month ago
Hi thanks for the suggestions!
Yes, using areas for histograms satisfies the proportional ink principle, but below are a few reasons I don't think we should do it.
binwidth = 0.01
, you see several values reach 200 with the proposed metric, whereas the data only has 32 observations in total.after_stat(sum(count) * density)
sums the counts over groups, which it shouldn't as density
is calculated within groups. The appropriate metric would be after_stat(count / width)
. As this is available as a simple combination of already available computed variables, I don't think this merits a novel computed variable.Yes, changing a default is a pain... IMO it's worth it, but I don't have a huge community to serve ;)
At the very least, I think this should be written somewhere in the docs (as this is how histograms are commonly defined*). Additionally, an example with after_stat(count / width)
can be added, with or without (or both) non-equi-width bins.
I'm willing to make (the world's smallest) PR if you'd like.
If you replace the breaks by
binwidth = 0.01
, you see several values reach 200 with the proposed metric, whereas the data only has 32 observations in total.
I don't see this as an issue - in PDFs, densities can also exceed 1 - it's just stats being stats 🤷♂️
* I only came to notice this when I was teaching histograms and a student pointed out that my plot didn't match what I had just said.
it's just stats being stats
Agreed, but it was meant to illustrate how it departed from counts even for equi-bins 🤓
Adding an example is a good idea, we'd welcome a PR for this.
Histograms convert counts withing bins into areas.
However, in
ggplot2
, the default behavior is to convert counts to bar heights.This discrepancy is typically not noticeable, because
stat_bin()
default to equi-width bins. However it becomes apparent when using non-equi-width bins.Here is an example with equi-probable bins, in which each column should have the same area.
But it should look like this:
Created on 2024-05-20 with reprex v2.1.0
I suggest:
height
) that is equal tosum(count) * density
stat_bin()
with non-equi-width bins.