tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.53k stars 2.03k forks source link

Simplify alignment for column geoms #4899

Closed wurli closed 2 years ago

wurli commented 2 years ago

Currently the alignment of columns is always centre, which may not always be desired. E.g. in the following case, values of date always give the first of the month, but are used to indicate the whole month (as is fairly common practice):

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

df <- tibble(
  month = as_date(c("2020-01-01", "2020-02-01", "2020-03-01")),
  value = 1:3
)

ggplot(df, aes(month, value)) +
  geom_col() +
  scale_x_date(date_labels = "%b %d")

image

In this case an align argument to geom_col() would be really useful to align the columns with the first of each month. align could accept values "centre" (the default), "right" and "left", which would be the option used here. The current alternatives are to use position = position_nudge(), which is fairly esoteric for such a simple task (and wouldn't always work that well, e.g. since February only has 28 days), or to instead use geom_rect(), which again seems much too complex for such a simple task.

If you agree that this sounds like a useful feature I'd be happy to submit a PR.

As always, thanks for the hard work on this beautiful package!

(N.B, this example is a bit contrived due to the use of scale_x_date() but it's the simplest example I could think of)

yutannihilation commented 2 years ago

Thanks for the suggestion, but this doesn't sound convincing to me. The problem of this example looks the width of each bar, rather than the alignment? (not sure if the intention of your example is to show daily values or monthly values)

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

df <- tibble(
  month = as_date(c("2020-01-01", "2020-02-01", "2020-03-01")),
  value = 1:3
)

ggplot(df, aes(month, value)) +
  geom_col(width = 1) +
  scale_x_date(date_labels = "%b %d")

Created on 2022-07-23 by the reprex package (v2.0.1)

wurli commented 2 years ago

Apologies, my initial example was a bit rushed and possibly didn't show my issue clearly enough. Perhaps this edit will help clarify. Here, columns would indicate monthly totals of value, but points show the more granular figures. In this example, each bar should overlap with three points, but clearly this isn't what happens by default - although it would be made really easy but an align argument.

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

df <- tibble(
  month = seq.Date(as_date("2020-01-01"), as_date("2020-03-31"), length.out = 9),
  value = 1:9
)

ggplot(df, aes(month, value)) +

  # Colums show totals for each month. Here the values of `month` are always
  # the first day of the month. The obvious solution is 'don't do this', but I'd
  # argue that it's such common practice that ggplot2 should facilitate this
  # sort of approach
  geom_col(
    data = ~ .x |>
      group_by(month = floor_date(month, "month")) |>
      summarise(value = sum(value))
  ) +

  # Points show the more granular values
  geom_point()

Created on 2022-07-25 by the reprex package (v2.0.1)

I guess the broader point is that the current behaviour is fine if using a discrete axis, which is probably the case for 90% of bar charts. For the remaining 10% which use a continuous axis it's not (as) obvious how the bar should be aligned, so I'd argue a bit more control is warranted. For this sort of thing, the current nudge argument doesn't quite hit the spot in my opinion.

yutannihilation commented 2 years ago

Ah, sorry, I didn't get your point. So, is this the plot you want to draw?

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

df <- tibble(
  month = seq.Date(as_date("2020-01-01"), as_date("2020-03-31"), length.out = 9),
  value = 1:9
)

width <- 0.9 * 30
ggplot(df, aes(month, value)) +
  geom_col(
    data = ~ .x |>
      group_by(month = floor_date(month, "month")) |>
      summarise(value = sum(value)),
    position = position_nudge(x = width / 2),
    width = width
  ) +
  geom_point()

Created on 2022-07-25 by the reprex package (v2.0.1)

wurli commented 2 years ago

I think it's very close. Correct me if I'm wrong, but I think that usually the width of the columns would be 0.9 * 29, not 0.9 * 30. I only know this from looking at the geom_col() source code - I think it'd be calculated roughly as follows:

res <- df$month |> 
  floor_date("month") |> 
  unique() |> 
  as.numeric() |> 
  resolution(zero = FALSE)

res
#> [1] 29

res * 0.9
#> [1] 25.2

I'm also not sure the left border of the column should exactly line up with the first of each month. With the default behaviour, some padding is added to the left and right of the column. It feels like this should possibly be the case with align = "left" too. Meaning you'd have width = 0.9 * 29 and position = position_nudge(x = (29 * 0.5) + (29 * 0.05)). Possibly having it 'flush' makes more sense though.

Anyway, I think this somewhat demonstrates what I'm trying to say - to achieve this a user has to know some fairly obscure details:

Seems to me much simpler to just add an align argument. Any thoughts? Thanks for bearing with.

yutannihilation commented 2 years ago

Thanks, I think your calculation is correct. I agree it might make sense.

wurli commented 2 years ago

Would you be happy to review a PR if I submitted one? To be honest I think it'd be quite simple to implement.

yutannihilation commented 2 years ago

Yes, I'm happy to review. I too feel the implementation won't be very complicated.

One thing I'd like to discuss here is the interface. In my opinion, hjust is better than align. hjust is more general and the horizontal positions are not necessarily limited to only the 3 values (center, right, left). You can just put hjust into xmin = x - width * (1 - hjust), xmax = x + width * hjust. But I agree align might be more intuitive to users.

wurli commented 2 years ago

Great, I'll get working on something.

Good point about the interface. I think, for the sake of consistency, you're right that hjust is better, not least because vjust would also be the obvious counterpart when using horizontal bars.

Possible second feature

This may be out of scope for this discussion, but another gripe I occasionally have with bar geoms is that it's only possible to 'base' the bars at x = 0 or y = 0. As an analogy, geom_area() has a more fine-grained counterpart geom_ribbon() which allows you to adjust the position of the base, but geom_col() has no such counterpart. I'd tentatively suggest adding arguments xmin/ymin to geom_col() to give some control here. One possible use-case would be in the creation of plots like the following:

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

df1 <- tibble(
  x = c(1, 1, 2, 2),
  y = c(-2, 1, -1, 2),
  fill = c("a", "b", "c", "d")
)

df2 <- tibble(
  xmin = c(1, 2) - 0.45,
  xmax = c(1, 2) + 0.45,
  ymin = c(-2, -1),
  ymax = c(1, 2)
)

ggplot(df1) +
  geom_col(aes(x, y, fill = fill)) +

  # The only way to achieve a border around the columns is to simulate a column
  # geom using `geom_rect()`, which requires a lot of knowledge about how 
  # width/resolution are calculated.
  geom_rect(
    aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
    colour = "black", fill = "transparent",
    data = df2
  )

Created on 2022-07-26 by the reprex package (v2.0.1)

This is again a fairly obscure use, but my opinion is that exposing xmin/ymin arguments (possibly only to geom_col()) would simply offer a bit of additional flexibility in a rather intuitive way. Another (more obvious) use-case is for when bars should simply begin at a different y-value because that's what the data dictates. Currently the way to approach such problems would probably be to adjust the y-scale (e.g. using scale_y_continuous(labels = ...)). If I'm putting in a PR this might be a good place to sneakily add such a feature 😄

thomasp85 commented 2 years ago

Sorry to reopen this in the eleventh hour of release. While writing the blog post I realise that I feel the meaning of 0 and 1 is backwards. In my head the just argument defines the justification of the bar relative to the axis break (so just = 0 would place the left side of the bar at the axis break), but in actuality it is the reverse.

Any objections to me switching it around before release?

wurli commented 2 years ago

Any objections to me switching it around before release?

Personally I'm a bit torn. My intuition is 0 = further left, 1 = further right, which is currently the case if your point of reference is the x-axis, but not if your point of reference is the bar itself. I find the former more intuitive, but happy to go with what you think as you'll be more familiar with the conventions in ggplot2.

clauswilke commented 2 years ago

Throughout ggplot, we're using justification values in two different contexts. Let's explain them for the case of hjust. (The same applies to vjust just vertically.) The first is how an object is placed relative to a reference point. In this case, hjust = 0 means that the object is placed such that the reference point is at the left-most location of the object, its own internal x=0 so to speak. And similarly, hjust = 1 means that the reference point is at the right-most location. Visually, this looks like hjust = 0 moves the object to the right, and hjust = 1 moves the object to the left.

The second is how an object is placed relative to a reference range. This is the case for example in the placement of the axis title relative to the horizontal extent of the plot. In this case, hjust = 0 means move the object all the way to the left so its left side is aligned with the left end of the reference range, and hjust = 1 does the opposite. Thus, hjust = 0 moves the object to the left, and hjust = 1 moves the object to the right.

To be consistent with the rest of ggplot, here, I think we need to figure out whether we're operating in the first or the second context, and apply the justification accordingly. I haven't looked into this too closely, but it sounds to me like we're operating under context 1, and therefore just = 0 should mean the bar sits to the right of the axis break, such that its left side is aligned with the break.

wurli commented 2 years ago

I was curious so took a look at other geoms - to me, behaviour doesn't seem to be that consistent, but maybe there's a rule I haven't spotted.

clauswilke commented 2 years ago

It's possible geom_raster() was implemented thinking about it the opposite way, using the extent of the raster as the reference range and the point on the plot as the thing that is positioned relative to the reference range.

Context 1 is used all over grid, in the way I've described. Also, legend justification follows context 1, if I remember correctly.

yutannihilation commented 2 years ago

Thanks. When I reviewed the pull request, I didn't consider the semantics of *just carefully.

I'm not sure I'm for or against the suggestion at the moment, but, at least, the behavior of geom_col() is consistent with geom_raster().

library(ggplot2)

d <- expand.grid(x = 1, y = 1:2)

ggplot(d, aes(x, y)) +
  geom_raster(hjust = 1, fill = "red", alpha = 0.5) +
  geom_raster(hjust = 0, fill = "blue", alpha = 0.5) +
  coord_equal()


ggplot(d[1,], aes(x, y)) +
  geom_text(size = 20, hjust = 1, label = "hjust = 1", colour = "red", alpha = 0.5) +
  geom_text(size = 20, hjust = 0, label = "hjust = 0", colour = "blue", alpha = 0.5) +
  coord_equal()


ggplot(d[1,], aes(x, y)) +
  geom_col(width = 1, just = 1, fill = "red", alpha = 0.5) +
  geom_col(width = 1, just = 0, fill = "blue", alpha = 0.5) +
  coord_equal()

Created on 2022-10-28 with reprex v2.0.2