tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

Change density default `bw = "nrd0"` to `bw = "sj"` #5854

Open teunbrand opened 5 months ago

teunbrand commented 5 months ago

This PR aims to fix #3825.

Briefly, ?density recommends the "sj" method over the "nrd0" default. This change propagates that recommendation to ggplot2's density calculations.

Some details:

thomasp85 commented 4 months ago

I'm not sure I think this is worth the breaking change to be honest... I probably miss a compelling example of when it would improve a visualisation

teunbrand commented 4 months ago

I don't know enough about these bandwidth estimators to make a qualified judgement.

However, this post shares some thoughts. Consider having multimodal data. The current default bw = "nrd0" smooths over everything.

library(ggplot2)
set.seed(0)

df <- data.frame(x = c(
  runif(20, 9, 11),
  runif(10, 19, 21),
  runif(20, 29, 31),
  runif(30, 39, 41)
))

ggplot(df, aes(x)) +
  geom_density(bw = "nrd0") +
  geom_rug() +
  ggtitle("bw = 'nrd0'")

Whereas bw = "sj" preserves the multimodality.

ggplot(df, aes(x)) +
  geom_density(bw = "sj") +
  geom_rug() +
  ggtitle("bw = 'sj'")

Counter-argument might be that if you have some exponential data, bw = "nrd0" gives a nice smooth density.

df <- data.frame(x = rexp(1000))

ggplot(df, aes(x)) +
  geom_density(bw = "nrd0", bounds = c(0, Inf)) +
  geom_rug() +
  ggtitle("bw = 'nrd0'")

Whereas bw = "sj" gives a more 'wobbly' curve.

ggplot(df, aes(x)) +
  geom_density(bw = "sj", bounds = c(0, Inf)) +
  geom_rug() +
  ggtitle("bw = 'sj'")

Created on 2024-05-20 with reprex v2.1.0

It is probably a trade-off that I cannot motivate very well. Maybe we can bat-signal @mjskay to see if we can get him to share his opinion.

mjskay commented 4 months ago

I could perhaps dig up some examples if there is interest. I changed the ggdist default to the equivalent of bw.SJ(..., method = "dpi") around the time I added automatic bounds detection, because I found it tended to do better with some bounded distributions (e.g. log normal, where nrd0 with the reflection method doesn't always have a small enough bandwidth to get the mode right). I started a more extensive evaluation of bandwidth estimators at some point but haven't had a chance to finish it.

mjskay commented 4 months ago

For example

set.seed(20240522)

data.frame(x = rlnorm(10000)) |>
    ggplot() +
    geom_density(aes(x, color = "sj"), bw = "sj", bounds = c(0, Inf), n = 1000) +
    geom_density(aes(x, color = "nrd0"), bw = "nrd0", bounds = c(0, Inf), n = 1000) +
    geom_function(fun = dlnorm, n = 1000, linetype = "22") +
    coord_cartesian(xlim = c(0,5))

image

It is worth pointing out that nrd0 is an easier-to-use default because the other bandwidth estimators may fail in some cases, particularly data with many duplicates. e.g.:

density(rep(1, 10), bw = "SJ")
## Error in bw.SJ(x, method = "ste") : sample is too sparse to find TD

ggdist handles this by falling back to nrd0 with a warning that the user probably shouldn't be using a kde on discrete data anyway (because, well, they shouldn't be...).

teunbrand commented 4 months ago

Thanks @mjskay, I appreciate your thoughts on the matter. I tend to agree with you and the authors of ?density that it probably is a better default than nrd0. Maybe this PR is worth reviving, but instead of changing the default bandwidth estimator, which might cause too much of a kerfuffle with users, ggplot2 could set a global option for it so it is easier to changes across density and violin layers simultaneously.