tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

Cryptic error when mis-specifying linetype #5864

Closed njspix closed 5 months ago

njspix commented 5 months ago

When 'linetype' is mis-specified, the resulting error is extremely confusing, and could be misleading if a color is also specified for the line. I would expect the error to read something like 'dash' is not a valid linetype.

Reprex:

library(tidyverse)
tibble(
  x = rnorm(50),
  y = rnorm(50)
) |> 
  ggplot()+
  geom_point(aes(x, y))+
  geom_abline(linetype = "dash", color = "gray")
#> Error in grid.Call.graphics(C_segments, x$x0, x$y0, x$x1, x$y1, x$arrow): invalid hex digit in 'color' or 'lty'

tibble(
  x = rnorm(50),
  y = rnorm(50)
) |> 
  ggplot()+
  geom_point(aes(x, y))+
  geom_abline(linetype = "dash")
#> Error in grid.Call.graphics(C_segments, x$x0, x$y0, x$x1, x$y1, x$arrow): invalid hex digit in 'color' or 'lty'

tibble(
  x = rnorm(50),
  y = rnorm(50)
) |> 
  ggplot()+
  geom_point(aes(x, y))+
  geom_abline(linetype = "dashed")

Created on 2024-04-25 with reprex v2.0.2

teunbrand commented 5 months ago

Thanks for the report! While I agree that this is a poor error message, this error isn't really under ggplot2's control as it is thrown from {grid}. Unrelated to linetype, I'm not sure that doing (computationally 'expensive') checks on input data is the best idea as it can hamper some performance.

njspix commented 5 months ago

Thanks for clarifying. I understand better now why the error is thrown, seems like dash is getting passed as a hex string to grid, and of course s and h are not valid hex.

Related to performance, I'm not a maintainer of a huge codebase used by 100k's of people, so I don't understand everyone's use case. However I fail to see how input validation could impose a substantial performance cost. I'll venture that the time it took me to understand this error was several orders of magnitude greater than the runtime of e.g. linetype %in% c("dashed", "dot" ...) | str_detect(linetype, "^[0-9a-f]+$")...

If performance is a concern (e.g. for folks making 1000's of plots), why not include a validate_input argument that defaults to true?

teunbrand commented 5 months ago

Please don't take the following personally, I do really empathise with the issue of encountering bad error messages. However, I have some remarks about why I think ggplot2 shouldn't validate all data input.

I'll venture that the time it took me to understand this error was several orders of magnitude greater

Yes, but the more relevant comparison is whether the runtime of the checks in all the plots that you ever made that didn't throw warnings exceeds the time it took for you to understand this error.

However I fail to see how input validation could impose a substantial performance cost.

For the sake of argument, let's say ggplot implements a validity check for linetype. Next, people are going to ask ggplot to also do a validity check for colour. Then, shape, or maybe the interaction between shape and fill. So the ask to do input validation for linetype can easily escalate to do many input validation steps.

Aside from the escalation argument, it should be noted that {grid} already does validity checks (as evident by the errors you've encountered). I appreciate a good error message as much as the next person, but I don't think ggplot2 should reimplement hairy parts of grid to throw friendlier messages. It might be more appropriate to petition {grid} directly to improve the messages.

Now any one validation step might not be expensive in 99% of plots, but let me illustrate the following. The function below checks whether every value in a vector is a valid colour that {grid} understands.

# perhaps there is a faster way, but this is what I got at the moment
is_color <- function(x) {
  grepl("^#(([[:xdigit:]]{2}){3,4}|([[:xdigit:]]){3,4})$", x) |
    x %in% grDevices::colours()
}

In some fields, it is not uncommon to plot giant scatterplots with millions of points. To check if a million values are valid colours takes ~100ms.

input <- sample(c(colours(), NA, "foobar"), 1e6, replace = TRUE)

bench::mark(is_color(input))
#> # A tibble: 1 × 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is_color(input)    116ms    116ms      8.62    22.9MB     43.1

However, the rest of the plumbing in making a plot (here represented by a small scatterplot) only takes ~35ms. So relative to the time it takes to render a plot, the check is 'expensive'.

library(ggplot2)

p <- ggplot(mpg, aes(displ, hwy)) +
  geom_point()

tmp <- tempfile(fileext = ".pdf")
pdf(tmp)

bench::mark(print(p))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 print(p)     33.2ms   35.3ms      28.2    6.76MB     43.2

dev.off()
#> png 
#>   2

Created on 2024-04-25 with reprex v2.1.0

njspix commented 5 months ago

Ah, that makes perfect sense. Thank you for clarifying!

I don't know that the error message is particularly bad from grid's perspective. For all it knows, it's just being passed an invalid hex color. And trying to ask them to change the error message just for ggplot creates a sort of 'accidental api', which I doubt would be mutually beneficial.

It's a bigger ask, but perhaps we could consider whether the 'linetype' argument is a little too flexible? I understand the attraction of flexibility, but I really think the error message is pretty bad... :-)

teunbrand commented 5 months ago

I don't think limiting what can be considered a linetype is the right way forward, as it'll hurt backward compatibility. All valid input is documented in the vignette. Grid's error message could just be Invalid linetype: "dash" like they do for colours. It doesn't need to be tailored to ggplot2. Saying that the hex code is wrong (incorrectly) assumes that the user was trying to pass a linetype as a hex code, which might start off the user on the wrong path to solving the issue.

In any case, while I agree with you that the error message can be improved, I don't think it is ggplot2's place to do input validation for the reasons outlined in my previous message. I'll close the issue here and we can reopen if other maintainers disagree with me.

njspix commented 5 months ago

Great, thank you very much!