tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

date-color handling default is inconsistent #5955

Closed mkoohafkan closed 3 months ago

mkoohafkan commented 3 months ago

consider the following dummy data:

library(dplyr)
library(ggplot2)

dummy = structure(list(date = structure(c(1600, 2487, -19, 18874, -1399, 
11317, 14305, 18243, 9749, 2438, 1316, 16388, -6202, -8418, -7626,
5893, 12640, 15081, -307, 9338, -354, 8533, 10286, 4053, 13599,
17817, -7314, -6072, 8072, -1439, 5657, 5773, 4671, 4181, 4163,
12228, 12862, 11345, 7762, 18785, 15429, 13123, 12809, 5138,
9737, 18997, -2008, -1856, 17046, 370, 16415, 4680, 18523, 17137,
3821, 15812, 545, 13807, 14378, -6642, 5415, 17762, -4702, -1519,
3777, 15052, 13027, 773, 6967, -3866, 2970, 4611, 9750, 1211,
3506, -8261, 10233, -557, 6619, 1815, 15321, 6687, 12141, 16965,
5565, -6286, 2691, 14073, 4201, 9048, -6601, 15888, 593, -3762,
14961, 3190, 17696, -2482, -7543, 10766), class = "Date"), x = c(152,
301, 148, 168, 14000, 111, 2900, 323, 127, 53, 360, 41.7, 308,
145, 921, 1340, 46, 232, 331, 11, 145, 182, 1410, 157, 235, 338,
300, 446, 547, 722, 75.9, 21, 98, 52.8, 58, 92, 534, 206, 846,
111, 155, 202, 170, 978, 51, 400, 954, 199, 12.7, 3300, 71, 41,
4860, 44.8, 193, 433, 209, 6.74, 162, 37, 26.3, 483, 244, 43,
380, 340, 494, 1180, 136, 1200, 587, 200, 329, 2300, 36, 460,
1830, 66, 437, 887, 27, 502, 380, 32.2, 238, 26, 88, 26.6, 43.1,
224, 411, 174, 117, 670, 72.3, 111, 216, 530, 751, 84), y = c(1.46,
1.63, NA, 1.36, NA, 0.7, 6.33, 1.82, 0.83, 1.06, NA, 0.65, NA,
NA, NA, 2.93, 0.49, 1.37, NA, 0.42, NA, 1.07, 3.55, 1.41, 1.01,
1.9, NA, NA, 1.74, NA, 0.94, 0.73, 1.08, 0.96, 1.03, 0.86, 1.81,
1.12, 2.55, 1.09, 1.05, 1.04, 1.12, 2.45, 0.61, 2.38, NA, NA,
0.33, NA, 0.9, 0.85, 9.67, 0.62, 1.24, 2.27, NA, 0.24, 1.23,
NA, 0.8, 2.41, NA, NA, 2.01, 1.72, 1.52, NA, 1.44, NA, 2.19,
3.62, 1.3, NA, 0.85, NA, 4.16, NA, 1.79, 2.79, 0.5, 1.88, 1.63,
0.48, 1.43, NA, 1.12, 0.38, 0.93, 1, NA, 1.19, NA, NA, 0.8, 1.22,
1.58, NA, NA, 0.59)), row.names = c(NA, -100L), class = c("tbl_df",
"tbl", "data.frame"))

When I specify color = date aesthetic without a corresponding scale_color_*() function, I get a nice color bar guide with breaks by year:

  ggplot(dummy) +
    aes(x = x, y = y, color = date) +
    geom_point()

image

However, when I specify scale_color_continous() or scale_color_gradient(), I lose the label formatting on the color bar guide:

  ggplot(dummy) +
    aes(x = x, y = y, color = date) +
    geom_point() +
    scale_color_continuous()

image

The first result is accomplished explicitly with

scale_color_continuous(trans = scales::transform_date())

The issue isn't so much that scale_color_*() doesn't automatically format by date, but rather that the (arguably better) unspecified color scale behavior is different from a plain scale_color_continuous().

teunbrand commented 3 months ago

I'm not sure if I'm understanding the issue correctly. Is the issue that scale_colour_date() (which is what unspecified defaults to based on the data type) has different label formatting than scale_colour_continuous()?

mkoohafkan commented 3 months ago

@teunbrand, yeah basically I would expect adding scale_color_continuous() with no arguments to the above plot would have the same result as not specifying a scale_color*() call at all, so these two plots to be identical:

p1 = ggplot(dummy) +
    aes(x = x, y = y, color = date) +
    geom_point()

p2 = p1 + scale_color_continuous()

# p1, p2 produce a visually identical plot

I guess the other way to phrase the issue is that I'm surprised by the automatic scales::transform_date() when no color scale is specified at all.

teunbrand commented 3 months ago

I can understand the confusion as the scale system is sort-of a mixup between scale types (e.g. scale_colour_date()) and scale palettes (e.g. scale_colour_brewer()). Maybe this date behaviour surprises you, but I'm sure people are familiar with character/factor variables invoking different default scales than numeric variables. What would you propose to change?

mkoohafkan commented 3 months ago

oh wow, I had no idea scale_color_date() existed! So that explains that. Note that this is not an issue of factor vs numeric (in this example both cases interpret date as continuous), it is what default transformation is being used.

Maybe this would be too verbose, but it would be neat if we could get informed what defaults where being used (similar to how {dplyr} informs you when you don't specify by in join functions or .groups in summarize()), e.g.

p1 = ggplot(dummy) +
    aes(x = x, y = y, color = date) +
    geom_point()
## Using `scale_color_date()`
teunbrand commented 3 months ago

The factor vs numeric is the exact same mechanism that lets the date class pick scale_colour_date() instead of scale_colour_continuous(), namely through the scale_type() methods.

I'm not sure that increasing the verbosity is a move welcomed by most. Default scales are implied in >50% all plots I'd bet, so this would create some distraction while not requiring user intervention.

mkoohafkan commented 3 months ago
> ggplot2:::scale_type(Sys.Date())
[1] "date"       "continuous"

Fair point that increasing verbosity is maybe not the way to go. I'll go ahead and close this issue.

clauswilke commented 3 months ago

The issue is closed already, but I also wanted to second not making ggplot2 more verbose. In my experience teaching ggplot2 to many hundreds of students, I have observed over and over again that they interpret these types of informative messages as errors. While these messages can be helpful to advanced users, they are scary to beginners.

teunbrand commented 3 months ago

I agree with what Claus is saying in principle. There are a few informative messages that bully users into doing the right thing and I think those can be useful. E.g. geom_histogram() bullies people into setting (better) bins or binwidth and likewise geom_smooth() begs to declare formula/method.