grantmcdermott / tinyplot

Lightweight extension of the base R graphics system
https://grantmcdermott.com/tinyplot
Apache License 2.0
211 stars 7 forks source link

density method #18

Closed grantmcdermott closed 1 year ago

grantmcdermott commented 1 year ago

Fixes #17

Examples:

library(plot2)

# no groups
plot2(density(mtcars$mpg))


# with groups
plot2(density(iris$Sepal.Width), by = iris$Species)

Created on 2023-04-11 with reprex v2.0.2

grantmcdermott commented 1 year ago

@zeileis All tests are passing so feel free to skip over this quickly if you don't have time to review. The main thing you may (or not) want to comment on are lines 583-605. In short, to get the individual group density plots I extract the model call and then run update on each data subset. This seemed the most sensible way forward from my perspective (and appears to be working well enough). But let me know if you see potential problems, or have other suggestions.

zeileis commented 1 year ago

Disclaimer: I'm not particularly fond of plotting the density f(x | by) against x. In most situations I'm more interested in the probability f(by | x). So this is what the "conditional density plot" cdplot() (which has mostly been written by me) plots against x. And the trick is to compute f(by | x) = f(x | by) * f(by)/f(x). So I often feel that plotting f(x | by) against x is the wrong way around.

Concerns: Having said that, I see two potential problems. The first is easier to address than the second, I think.

Example:


## convenience function returning a density
log_density <- function(x) density(log(x))

## example data
y <- 1:10
by <- rep(1:2, each = 5)

## set up density and try to plot
d <- log_density(y)
plot2(d, by = by) ## Error in log(x) : non-numeric argument to mathematical function

## or even worse: this would be completely misleading if you had eval(..., envir = parent.frame())
x <- by
plot2(d, by = x)
grantmcdermott commented 1 year ago

These are all great comments @zeileis. Some quick responses:

  1. cdplot: It would be super if we could support cdplot. This strikes me as an obvious candidate for faceting once the internals of #12 are resolved.
  2. bandwidth: Fully agree with you about this. I'll fix it shortly. (I think we still need update instead of density to capture the other arguments like "kernel" and ..., though right? That, or some manual implementation of what update does internally.)
  3. environment: Yeah, this is tricky because of the limitations of density itself (as you mention). Given your third example, I'm reluctant to go all in on the envir = parent.frame() passing. I might just add a tryCatch error message in there.

P.S. I'm not sure about the plo2t(x ~ by, type = "density") syntax because it breaks the | by API the we have elsewhere. But I might be persuaded by a one-side formula version, i.e. plot2(~ x | by, type = "density"). Will think on this some more.

zeileis commented 1 year ago

Thanks for the follow-up!

  1. cdplot: OK, will work on that hopefully today or tomorrow. The week was rather busy.
  2. Bandwidth/kernel: You are right. I thought that the kernel was provided explicitly in the "density" object but it isn't. So update() is the best we can do. Of course, the kernel argument might also come from somewhere where update() cannot find it without the right environment.
  3. Environment: I think that envir = parent.frame() is the only consistent choice because from my reading this is what update() does. Otherwise you might search for the x argument in a different environment than for the other variables (bandwidth, kernel, etc.). I can try to cook up an example if you need more convincing ;-)
  4. Formula: My gut feeling was that a one-sided formula is more confusing but you are probably right that it is more consistent. We still might support a type argument there to choose between histogram and density.
  5. Area shading: I think it might be a nice touch to optionally add some semi-transparent area shading below the density or histogram line. The area might also be opaque if we order the by groups (e.g., by median) from highest to lowest, which would yield a kind of ridge lines plot.
grantmcdermott commented 1 year ago

I'm convinced ;-)

I believe that my latest changes address the main shortcomings, so feel free to "Squash and merge" if you agree.

Quickly on 4 and 5, which I'll leave for a future set of PRs.

  1. Supporting extra type arguments is a cool a idea. We should look at supporting type = "hist" too.
  2. Yeah, this would be great. I need to figure out better support for polygons in general, though.

EDIT: forgot to include an example of the updated grouped density plot (i.e., with the joint bandwidth calc).

library(plot2)
plot2(density(iris$Sepal.Width), by = iris$Species)

Created on 2023-04-13 with reprex v2.0.2

zeileis commented 1 year ago

Looks good, thanks. I'll post a wishlist item for 4/5.