tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.39k stars 2k forks source link

ggplot2 ignores the user's input for transparency via NA's in scale_fill_manual() #5929

Open binkleym opened 1 month ago

binkleym commented 1 month ago

I encountered a problem with scale_fill_manual where ggplot was explicitly ignoring the NA (for transparency) I was attempting to use.

The problem appears to be this commit, where the default value for NA's is set to "gray50". Instead of trusting that the user knows what they're doing when they set NA for transparency, it overrides them behind their back without letting them know.

https://github.com/tidyverse/ggplot2/commit/c9adeeddca2ecba9162f231a5c5d326f103e691b

It was quite maddening to try to figure out what was going on, since scale_fill_manual() changes the behavior silently.

It would be useful (and merciful on anyone trying to debug similar things) to have scale_fill_manual() spit out a message on NA's similar to

Setting NA's to "gray50".    Please add 'na.value = NA' if you wish transparency

Code to demonstrate issue:

library(tidyverse)
library(sf)

map_wind <-
  read_sf("https://www.spc.noaa.gov/products/outlook/archive/2024/day1otlk_20240530_0100_hail.nolyr.geojson")

ggplot() +
    theme_bw() + 
    geom_sf(data = map_wind, aes(fill = LABEL), color = NA, show.legend = TRUE) +

    scale_fill_manual(
      limits = c("0.05",     "0.15",    "0.30",    "0.45",    "0.60",   "SIGN"),
      values = c("#c5a393", "#ffeb80", "#ff8080", "#ff00ff", "#912cee", NA)
    )

The image on the left shows the incorrect behavior. scale_fill_manual() is silently replacing my NA (transparency) with 'gray50'.

The image on the right shows how it should look. scale_fill_manual() does what I tell it to do, and renders transparency as actual transparency.

na_value

teunbrand commented 1 month ago

Simplified reprex:

library(ggplot2)

ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_point() +
  scale_colour_manual(values = c("red", "blue", NA))

The simple solution would be to use "transparent" instead of NA. Or any transparent colour like "#FFFFFF00".

ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_point() +
  scale_colour_manual(values = c("red", "blue", "transparent"))

Created on 2024-06-03 with reprex v2.1.0

I think the scale is doing exactly what is supposed to do by translating NA to na.value.

binkleym commented 1 month ago

The above is contrary to ggplot2's own documentation:

https://ggplot2.tidyverse.org/reference/aes_colour_fill_alpha.html

"An NA, for a completely transparent colour."

By the docs, NA should select transparency, not "gray50". Which I'm not asking to revert, I'm just asking that info be printed when NA's are overridden so that people don't spend hours debugging when default behavior changes.

clauswilke commented 1 month ago

@teunbrand Not sure I follow:

I think the scale is doing exactly what is supposed to do by translating NA to na.value.

In your reprex, NA is not a data value but a color value, i.e. it is being used after translation from data to color. Why should it be converted into na.value? At least conceptually this seems wrong.

It's a question of ordering of operations. Do you first convert all NAs into na.value and then convert all other data values into the respective colors or do you do it the other way round?

I haven't looked at the code, it may be a minor fix changing ordering of operations or it may be a major headache and not worth the cost to fix, but I agree it feels wrong to me the way it currently works.

clauswilke commented 1 month ago

Ok, seems like the issue is that the code uses NA in pal_match as an indication that the respective data value was not found in the palette and hence it cannot distinguish this scenario from cases where there was a match but the match returned NA.

https://github.com/tidyverse/ggplot2/blob/28aec3a90d2798382c7da384fece6227e12092f8/R/scale-.R#L985

Definitely not correct behavior in my opinion but maybe not worth the effort to fix.

teunbrand commented 1 month ago

Why should it be converted into na.value?

If it shouldn't, then na.translate should be set to FALSE. The discrete palette NA handling code lives here:

https://github.com/tidyverse/ggplot2/blob/28aec3a90d2798382c7da384fece6227e12092f8/R/scale-.R#L984-L988

Briefly, pal_match is a translation of user values to palette values. It can be NA because the palette contains NA (as in this issue) or it can be NA because there is no match between the data values and limits. The code doesn't discriminate between these two cases.

clauswilke commented 1 month ago

If it shouldn't, then na.translate should be set to FALSE.

According to the documentation, na.translate is about missing data values, not about colors that have been specified as NAs. Again, I don't think this is worth fixing, but I continue to agree the behavior is not quite correct. The simplest fix might be to just tell people in the documentation to not specify colors as NA but instead use "transparent".

clauswilke commented 1 month ago

(Simple thought experiment: I might want to translate NAs in the data into na.value and also use transparency for some other, non-missing data value. Clearly in this case I need na.translate = TRUE.)

teunbrand commented 1 month ago

The docs of na.translate just read 'missing values' without specifying 'data values' or 'mapped values'. I'm not entirely convinced that the current behaviour is wrong: people might also be setting NAs in palettes for the purpose of defaulting to na.value and guessing user intent is hard from the available information. I think adjusting the docs to not recommend colour = NA but use colour = "transparent" instead is perhaps the least invasive method of resolving this.

clauswilke commented 1 month ago

Here is an example to show my thought experiment is not completely contrived. The only working solution is to specify the transparent color as "transparent". Either way I think we agree the docs should be changed.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

texas_income <- readRDS(url("https://wilkelab.org/SDS375/datasets/Texas_income.rds"))

p <- texas_income %>% 
  mutate(
    top_ten = if_else(is.na(median_income), NA, rank(desc(median_income)) <= 10)
  ) %>% 
  ggplot(aes(fill = top_ten)) + 
  geom_sf() +
  theme_bw()

# works
p +
  scale_fill_manual(
    breaks = c(TRUE, FALSE, NA),
    labels = c("top 10 income", "other", "no data"),
    values = c(`TRUE` = "#D55E00", `FALSE` = "transparent")
  )


# doesn't work
p +
  scale_fill_manual(
    breaks = c(TRUE, FALSE, NA),
    labels = c("top 10 income", "other", "no data"),
    values = c(`TRUE` = "#D55E00", `FALSE` = NA)
  )


# doesn't work either
p +
  scale_fill_manual(
    breaks = c(TRUE, FALSE, NA),
    labels = c("top 10 income", "other", "no data"),
    values = c(`TRUE` = "#D55E00", `FALSE` = NA),
    na.translate = FALSE
  )

Created on 2024-06-03 with reprex v2.0.2

binkleym commented 1 month ago

I think adjusting the docs to not recommend colour = NA but use colour = "transparent" instead is perhaps the least invasive method of resolving this.

My problem is already solved, but for other R users who might hit this issue because the default behavior has changed, I still think printing something that lets them know NA's are being overridden is the right thing. Changing documentation doesn't change existing code bases, nor does it inform users used to the older behavior that things have changed.

By letting users know what's changed and how to fix it, you get the new behavior, and they don't spend hours debugging. It's win-win for everyone.

clauswilke commented 1 month ago

@binkleym I suspect that's a completely different issue. Either way, can you use my example to make a simple reprex?

binkleym commented 1 month ago

Life suddenly got busy, but I will as soon as I can.