tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.45k stars 2.02k forks source link

`stat_bin()` calculates wrong numbers when a `geom_col()` with custom `data` is present #5829

Closed fkohrt closed 4 months ago

fkohrt commented 4 months ago

In an attempt to create a bar plot while binning data, I have stumbled across a surprising interaction between geom_col() and geom_text().

The following code is supposed to do two things:

  1. Create bars with geom_col() by calculating data on my own
  2. Annotate the bars with geom_text() by using stat_bin() to calculate the data
faithful |>
  ggplot2::ggplot(ggplot2::aes(x = eruptions)) +
  ggplot2::geom_col(
    data = function(x) {
      bins <- ggplot2:::bin_breaks_bins(
        x_range = range(faithful$eruptions),
        bins = 30,
        center = NULL,
        boundary = 4,
        closed = "right"
      )
      h <- hist(
        faithful$eruptions,
        plot = FALSE,
        breaks = bins$breaks
      )
      data.frame(mids = h$mids, count = h$counts)
      # Alternatively, return ggplot2:::bin_vector(x = faithful$eruptions, bins = bins)
      # and map x = xmin + width/2
    },
    mapping = ggplot2::aes(
      x = mids,
      y = count
    ),
    inherit.aes = FALSE
  ) +
  ggplot2::geom_text(
    mapping = ggplot2::aes(
      y = ggplot2::after_stat(count),
      label = ggplot2::after_stat(count)
    ),
    stat = "bin",
    bins = 30,
    binwidth = NULL,
    center = NULL,
    boundary = 4,
    closed = "right"
  )

However, this leads to suprising behavior in two regards:

  1. The numbers calculated by stat_bin() are off.
  2. They become correct if the whole geom_col() is commented out.

Is this somehow correct behavior or a bug?

Using ggplot2 3.5.0 with R 4.3.3

(Why create the graph above like that in the first place? I deliberately want to avoid geom_histogram() because the count bars are not supposed to touch each other, hence geom_col().)

teunbrand commented 4 months ago

Thanks for the report! It seems that the computation of the bins do not match among the two layers. I'd just use geom_histogram() and override the width parameter computed by the stat. It throws a warning but it does the correct thing. We plan to make width a proper aesthetic at some point, so then the warning will be gone.

library(ggplot2)

faithful |>
  ggplot(aes(x = eruptions)) +
  geom_histogram(
    aes(width = after_stat(0.9 * width)),
    bins = 30,
    binwidth = NULL,
    center = NULL,
    boundary = 4,
    closed = "right"
  ) +
  geom_text(
    mapping = aes(
      y = after_stat(count),
      label = after_stat(count)
    ),
    stat = "bin",
    bins = 30,
    binwidth = NULL,
    center = NULL,
    boundary = 4,
    closed = "right"
  )
#> Warning in geom_histogram(aes(width = after_stat(0.9 * width)), bins = 30, :
#> Ignoring unknown aesthetics: width

Created on 2024-04-06 with reprex v2.1.0

teunbrand commented 4 months ago

Yeah I can confirm that the warning would be fixed by #5807, so you woun't need to circumvent geom_histogram() with custom data, which should make the discrepancy disappear.

fkohrt commented 4 months ago

I agree that setting width is an elegant way to achieve my goal, but isn't the interaction between geom_col() and stat_bin() still worth tracking?

teunbrand commented 4 months ago

I think the interaction is due to that the bin breaks calculation in the text layer is based on the x-axis range of the plot as a whole. The range of h$mids is slightly larger than range(faithful$eruptions), resulting in different bins in the text layer. Computing bins in the full plot range is intended, so I don't think this is abug.