Open teunbrand opened 5 months ago
I'm not sure I think this is worth the breaking change to be honest... I probably miss a compelling example of when it would improve a visualisation
I don't know enough about these bandwidth estimators to make a qualified judgement.
However, this post shares some thoughts.
Consider having multimodal data.
The current default bw = "nrd0"
smooths over everything.
library(ggplot2)
set.seed(0)
df <- data.frame(x = c(
runif(20, 9, 11),
runif(10, 19, 21),
runif(20, 29, 31),
runif(30, 39, 41)
))
ggplot(df, aes(x)) +
geom_density(bw = "nrd0") +
geom_rug() +
ggtitle("bw = 'nrd0'")
Whereas bw = "sj"
preserves the multimodality.
ggplot(df, aes(x)) +
geom_density(bw = "sj") +
geom_rug() +
ggtitle("bw = 'sj'")
Counter-argument might be that if you have some exponential data, bw = "nrd0"
gives a nice smooth density.
df <- data.frame(x = rexp(1000))
ggplot(df, aes(x)) +
geom_density(bw = "nrd0", bounds = c(0, Inf)) +
geom_rug() +
ggtitle("bw = 'nrd0'")
Whereas bw = "sj"
gives a more 'wobbly' curve.
ggplot(df, aes(x)) +
geom_density(bw = "sj", bounds = c(0, Inf)) +
geom_rug() +
ggtitle("bw = 'sj'")
Created on 2024-05-20 with reprex v2.1.0
It is probably a trade-off that I cannot motivate very well. Maybe we can bat-signal @mjskay to see if we can get him to share his opinion.
I could perhaps dig up some examples if there is interest. I changed the ggdist default to the equivalent of bw.SJ(..., method = "dpi")
around the time I added automatic bounds detection, because I found it tended to do better with some bounded distributions (e.g. log normal, where nrd0 with the reflection method doesn't always have a small enough bandwidth to get the mode right). I started a more extensive evaluation of bandwidth estimators at some point but haven't had a chance to finish it.
For example
set.seed(20240522)
data.frame(x = rlnorm(10000)) |>
ggplot() +
geom_density(aes(x, color = "sj"), bw = "sj", bounds = c(0, Inf), n = 1000) +
geom_density(aes(x, color = "nrd0"), bw = "nrd0", bounds = c(0, Inf), n = 1000) +
geom_function(fun = dlnorm, n = 1000, linetype = "22") +
coord_cartesian(xlim = c(0,5))
It is worth pointing out that nrd0
is an easier-to-use default because the other bandwidth estimators may fail in some cases, particularly data with many duplicates. e.g.:
density(rep(1, 10), bw = "SJ")
## Error in bw.SJ(x, method = "ste") : sample is too sparse to find TD
ggdist handles this by falling back to nrd0
with a warning that the user probably shouldn't be using a kde on discrete data anyway (because, well, they shouldn't be...).
Thanks @mjskay, I appreciate your thoughts on the matter. I tend to agree with you and the authors of ?density
that it probably is a better default than nrd0
. Maybe this PR is worth reviving, but instead of changing the default bandwidth estimator, which might cause too much of a kerfuffle with users, ggplot2 could set a global option for it so it is easier to changes across density and violin layers simultaneously.
This PR aims to fix #3825.
Briefly,
?density
recommends the"sj"
method over the"nrd0"
default. This change propagates that recommendation to ggplot2's density calculations.Some details:
calc_bw()
andprecompute_bw()
were doing the exact same task, so I merged these.var(x) == 0
, most of thestats::bw.*()
functions throw an error, so we keepbw.nrd0()
for these degenerate cases because it will not throw an error.