tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

Documentation of stat_ellipse() is misleading #2776

Closed MJochim closed 5 years ago

MJochim commented 6 years ago

According to its documentation, stat_ellipse() creates “normal confidence ellipses.” These ellipses can be configured by providing a confidence level as the parameter level.

This leads me to believe that stat_ellipse() creates a confidence ellipse, in the sense that it is the bivariate equivalent of a confidence interval. If this were the case, I would expect the ellipse to shrink with growing sample sizes (as do confidence intervals).

However, I do not observe this shrinking behavior (see code example below).

The documentation does contain a reference to ?car::ellipse, which “draw[s] ellipses, including data ellipses, and confidence ellipses.” To the best of my understanding, what stat_ellipse() produces is a data ellipse, not a confidence ellipse.

I must admit, though, that I am not a statistician and I am not sure if there is a standard way to calculate such confidence ellipses; so my expectation to have one drawn very easily might be a bit off. Plus, I hope to understand the concepts of data and confidence ellipse right – I did my best to read up on them for a couple of days. Still, I think that the documentation needs clarification about what kind of ellipse stat_ellipse() actually produces.

Thanks,

Markus

## This is to demonstrate that the ellipse drawn by ggplot2::stat_ellipse() does
## not correlate with sample size.

library(tidyverse)

for (my_sample_size in c(5, 20, 400, 1000)) {
  # Construct the 2x2 covariance matrix needed to draw a bivariate sample
  my_covariance_matrix = matrix(nrow = 2,
                                ncol = 2,
                                data = c(10^2,    0, # SD of parameter 1 is 10
                                            0, 10^2) # SD of parameter 2 is 10
                                # the 0's indicate that there is no correlation between parameters 1 and 2.
  )

  # Draw the bivariate sample
  my_sample =
    MASS::mvrnorm(n = my_sample_size,
                  mu = c(100, 100), # mean of parameters 1 and 2, respectively
                  Sigma = my_covariance_matrix) %>%
    as_tibble()

  # Plot the sample in a fixed coordinate system with stat_ellipse()
  print(
    ggplot(my_sample) + 
    ggtitle(paste("Sample size", my_sample_size)) +
    aes (y = V2, x = V1) +
    geom_point() +
    stat_ellipse() +
    coord_cartesian(xlim = c(50, 150), ylim = c(50, 150))
  )
}
batpigandme commented 6 years ago

Above code rendered using reprex for ease of discussion:

library(tidyverse)

for (my_sample_size in c(5, 20, 400, 1000)) {
  # Construct the 2x2 covariance matrix needed to draw a bivariate sample
  my_covariance_matrix = matrix(nrow = 2,
                                ncol = 2,
                                data = c(10^2,    0, # SD of parameter 1 is 10
                                         0, 10^2) # SD of parameter 2 is 10
                                # the 0's indicate that there is no correlation between parameters 1 and 2.
  )

  # Draw the bivariate sample
  my_sample =
    MASS::mvrnorm(n = my_sample_size,
                  mu = c(100, 100), # mean of parameters 1 and 2, respectively
                  Sigma = my_covariance_matrix) %>%
    as_tibble()

  # Plot the sample in a fixed coordinate system with stat_ellipse()
  print(
    ggplot(my_sample) + 
      ggtitle(paste("Sample size", my_sample_size)) +
      aes (y = V2, x = V1) +
      geom_point() +
      stat_ellipse() +
      coord_cartesian(xlim = c(50, 150), ylim = c(50, 150))
  )
}

Created on 2018-07-26 by the reprex package (v0.2.0.9000).

leonardblaschek commented 5 years ago

I would like to 'bump' this issue (sorry for my illiteracy in github etiquette), as I agree that this is quite misleading. If Markus is right (which I think he is) the description should clarify that these are in fact data ellipses, not confidence ellipses. Additionally, it would be great to add a functionality to calculate actual confidence ellipses. Here is a link that filled some gaps for me: https://arxiv.org/pdf/1302.4881.pdf

ptoche commented 5 years ago

The default stat_ellipse() is intended to compute a bivariate confidence interval assuming a Student-t distribution. This is stated at the top of the code. Quote:

#' Compute normal confidence ellipses
#'
#' The method for calculating the ellipses has been modified from
#' `car::ellipse` (Fox and Weisberg, 2011)
#'
#' @references John Fox and Sanford Weisberg (2011). An {R} Companion to
#'   Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL:
#'   \url{http://socserv.socsci.mcmaster.ca/jfox/Books/Companion}
#' @param level The confidence level at which to draw an ellipse (default is 0.95),
#'   or, if `type="euclid"`, the radius of the circle to be drawn.
#' @param type The type of ellipse.
#'   The default `"t"` assumes a multivariate t-distribution, and
#'   `"norm"` assumes a multivariate normal distribution.
#'   `"euclid"` draws a circle with the radius equal to `level`,
#'   representing the euclidean distance from the center.
#'   This ellipse probably won't appear circular unless `coord_fixed()` is applied.

John Fox is also the author of the car library. The car library manual can be consulted for further information. That library provides an ellipse function, a dataEllipse() and a confidenceEllipse().

Judging from the intent stat_ellipse() should calculate confidence ellipses. From a quick read of the manual, it looks like the key difference is that the confidence ellipse takes a regression model as an argument. I'm guessing that, in this case, the width of the confidence interval --- the radius of the ellipse --- is based on the standard errors computed by the model lm(). Can stat_ellipse() not do that too if you pass a model?

Edit: As discussed below, the documentation is indeed misleading and you cannot currently pass a linear model to the stat_ellipse() function. I explore below how one might extend the function to add the functionality.

clauswilke commented 5 years ago

Yes, stat_ellipse() definitely draws a data ellipse, not a confidence ellipse. I'm not sure why it uses the multivariate t distribution as default, car::dataEllipse() does not (it uses multivariate normal by default and calls multivariate t the "robust" case). I think the simplest change would be to make the heading of the documentation page read "Compute normal data ellipses" and in the body replace car::ellipse() with car::dataEllipse(). Maybe somebody can prepare a PR that accomplishes this?

ptoche commented 5 years ago

I've had a bit more time to look into it. stat_ellipse() draws a data ellipse. The code is borrowed from the car library. It does not look very difficult to extend stat_ellipse() to draw confidence ellipses. Below I edited stat-ellipse.R to accept a formula to pass linear models and to accept type=confidence in addition to the other three types. The ellipses are produced, but don't look right. I was about to give up when I noticed Claus's message. I decided to post my efforts below and hope someone can take it the extra step.

MJochim commented 5 years ago

Cool, thanks for all your efforts. I'm willing to contribute here if I can; but now that Patrick is on to something, we should coordinate this, lest we have two competing pull requests ;).

So there are two things here, one is the documentation change that Claus outlined, and another is Patrick's approach to implement actual confidence ellipses. I can definitely do the docs PR. I can take a look tomorrow to see if I can contribute anything on Patrick's side, too. Mind you, however, I know more about coding than about statistics ;-).

ptoche commented 5 years ago

@MJochim ,

I have forked/cloned the ggplot2 repository here.

Run example:

ggplot(data = mtcars, aes(x = mpg, y = wt)) + 
    geom_point() +
    stat_ellipse_model(type = "confidence", formula = y ~ x, color = "red", size = 2) 
clauswilke commented 5 years ago

@ptoche You need to fork the repo before you can work with it/create PRs: https://guides.github.com/activities/forking/

It's also a good idea to create a branch for every issue you work on, because otherwise you can only ever work on one issue at a time: https://guides.github.com/introduction/flow/

ptoche commented 5 years ago

Thanks Claus. Got it, I missed the forking part. Total noob as my son calls me.

I've edited the instructions above and deleted the unnecessary code I had pasted into the comment boxes. I have forked/cloned the ggplot2 repository here, edited the code and pushed it. How do I proceed from here?

Do I need to "Make a Pull Request" as explained in the document you linked to?

Edit: Here are the steps I have followed: After forking/cloning/editing, I built the package, installed it and loaded it, with:

# 1. build the package
devtools::build("~/Documents/GitHub/ggplot2", path = "~/Documents/Library")

# 2. install the package in local library
install.packages("~/Documents/Library/ggplot2_3.1.0.9000.tar.gz", 
    repos = NULL, type = "source", 
    lib = "~/Documents/Library") 

# 3. run the package from the local library
library(ggplot2, lib.loc = "~/Documents/Library") 

Did I get this right? It seems to work. At first I had an error, apparently caused by a missing package: "profvis"

Thanks for your guidance Claus!

Patrick.

clauswilke commented 5 years ago

For R package development, I would strongly recommend using R Studio. It allows you to build, check, test, etc. a package with one click from the "Build" menu.

A pull request is the best way to review and discuss changes, so yes, a pull request helps even if your work isn't completed and/or is eventually deemed not to be appropriate for the main code base. You can add "WIP" to the title to indicate that it's work in progress.

ptoche commented 5 years ago

Oh I haven't used RStudio's package development feature yet. Will look into it. (I do use RStudio!). Thanks again for the guidance!

https://github.com/tidyverse/ggplot2/pull/3041

leonardblaschek commented 5 years ago

Thank you for taking charge of this! I am take note of Claus' pointers as well, and maybe next time around I can contribute myself.

ptoche commented 5 years ago

I'm learning about "confidence ellipses" from

Michael Friendly and Georges Monette and John Fox, "Elliptical Insights: Understanding Statistical Methods Through Elliptical Geometry", Statistical Science, Vol. 28 (1), 2013, pages 1-39.

https://projecteuclid.org/download/pdfview_1/euclid.ss/1359468407

Probably wise to have the stat_ellipse() function provide a single basic "confidence ellipse" and the ability to tweak it via the basic parameters, which I think are just center, shape (the variance/covariance matrix), and the radius (depending on assumptions and regression models, different center/shape/radii may apply, as shown in the article above). Probably wise to keep the computations outside of ggplot. I think Hadley said something like that about stat_smooth() when suggestions were made to extend the class of linear models to apply...

ptoche commented 5 years ago

I've discovered that a "confidence ellipse" is not what I thought it was. I assumed it would be an ellipse in the (x, y) plane for the marginal effects of a model like y ~ x + z (with z potentially multivariate). But no. Instead, it's an ellipse in the (bz, bx) plane (where bz and bx are the slope estimates), designed to illustrate the joint hypothesis bx = 0 and bz = 0. My code computes the values of the (bz, bx) ellipse, but then plots them incorrectly in the (x, y) plane...

Here Figure 12 from the article cited above:

figure12
clauswilke commented 5 years ago

Ah, so it will probably not make sense to add the calculations for a confidence ellipse to stat_ellipse(). In either case, I think it would make sense to first prepare a PR that only fixes the documentation for the current code. That is needed one way or another.

MJochim commented 5 years ago

I can offer to do that tomorrow. (that is, if Patrick has not finished it already by then - you seem very motivated :-) :+1: )

ptoche commented 5 years ago

Hi MJochim, please do go ahead with the doc PR. Thanks.

In the meantime I'll think about what to do with stat_ellipse().

I'm thinking this:

  1. Keep most of the code in stat_ellipse() unchanged, except "update" the formula to the latest formula from the car library (change from pivot = FALSE to pivot = TRUE (whatever the reasons for that change, I'll assume it is wise, as the three authors involved in writing up the code are statistics professors that have written papers on ellipses!) and add in comments to the code, which will make it easier to read and maintain. EDIT: I have pushed the suggested changes to my fork, please copy it if you like.

  2. Write a function stat_ellipse_model() with the code for the "confidence ellipse" separate from the rest. It will be more readable.

ptoche commented 5 years ago

My fork has the original code of stat_ellipse() with a change in the Cholesky formula, minor changes in notation to match library car and with small comments for clarity of reading. The default data ellipse is "normal". I did not make the change for partial matching of "norm" and "normal". I also updated the reference to the car library (current date is 2019, btw). Edit: Oh wait, from my repo, you need to remove "confidence" from the list of accepted types, left it there by mistake (re-Edit: now fixed in my repo).

Two minor points:

  1. The original code has stats::qf() instead of qf(). Is that good coding practice? (likewise with stats::cov.wt). Edit: I just read Hadley state that it's better to be explicit.

  2. Is it a good idea to make a minor adjustment so that stat_ellipse() can understand normal as well as norm? Edit: Figured it out. It's a simple as:

    type <- revalue(type, c("norm" = "normal"))

clauswilke commented 5 years ago

Is it a good idea to make a minor adjustment so that stat_ellipse() can understand normal as well as norm?

This is normally done with the match.arg() function. See the examples in its documentation for conventional use.

ptoche commented 5 years ago

Thanks Claus. That would be type <- match.arg(type, "normal")?

clauswilke commented 5 years ago

Please look at the first code example in the documentation to match.arg().

ptoche commented 5 years ago

I see thanks. I was looking at the last example!

fehtemam commented 5 years ago

Thank you so much, everyone, for working on this. Although it seems something really simple (it's just going from one dimension to two) it's surprisingly difficult to find clearly explained definitions on different types of ellipses and different people on the internet say different things. And most people tend to call any type of ellipse a confidence ellipse! I've learned so far that there are confidence, prediction, tolerance and data ellipses (also known as quantile ellipse). I still don't fully understand everything and I am not as good as you guys in math or R but I think so far I understand that what stat_ellipse is drawing is not a confidence ellipse. It should be called data or quantile ellipse. A confidence ellipse comes from the estimation of some statistics (usually mean). And confidence ellipses of estimation of the mean are much smaller than data ellipses. Before coming across this page I search a lot and found a few good Q&As. I'll leave them here so maybe it ends up being useful for other people coming to this page searching for this topic and being confused about it: https://stats.stackexchange.com/questions/217374/real-meaning-of-confidence-ellipse https://stats.stackexchange.com/questions/38117/getting-different-results-when-plotting-95-ci-ellipses-with-ggplot-or-the-ellip?rq=1 This is also a discrepancy I still don't understand but doesn't change the need for revising the stat_ellipse documentation: https://stats.stackexchange.com/questions/347864/why-is-a-confidence-ellipse-based-on-multivariate-t-tighter-than-one-base-on-mul Sorry if my post doesn't match any rules for posting here. It's my first time posting.

ptoche commented 5 years ago

@fehtemam, thanks for the links. I'm going to follow the definitions in the article by Friendly, Monette and Fox, as it's the clearest exposition I've found so far. The stackexchange discussions are difficult to use as they are partial. But if you have a published reference to "tolerance" ellipse and such, feel free to add it to your list.

fehtemam commented 5 years ago

@ptoche Oh I didn't put those links to follow for anyone who is working on stat_ellipse. Since unfortunately this page is one of the few pages where this topic is discussed for anyone using R, I thought for people who come across this page it might be helpful to learn more about it. Definitely follow the Friendly et al. article as it is a peer reviewed publication. In terms of credible sources I don't know how much SAS documentation is credible but this documentation from them discusses confidence versus prediction ellipses: https://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_corr_details19.htm I'll look to see if I find anything credible on tolerance ellipses (I mostly learned from random articles on web which are not peer reviewed). Again not all of these ellipses are of interest in the chain of discussions on this page but they are relevant and it may clear up things for people like me who will end up on this page. Thanks!

ptoche commented 5 years ago

@MJochim , In the tests/testthatdirectory, the test file ought to be renamed. It's currently named test-stat-ellipsis.r. See also inside. Seems like a simple fix you can do in passing while fixing the doc.

MJochim commented 5 years ago

Will do that.

Am 21. Dezember 2018 17:29:21 MEZ schrieb Patrick Toche notifications@github.com:

@MJochim , In the tests/testthatdirectory, the test file ought to be renamed. It's currently named test-stat-ellipsis.r. See also inside. Seems like a simple fix you can do in passing while fixing the doc.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/tidyverse/ggplot2/issues/2776#issuecomment-449434025

ptoche commented 5 years ago

@fehtemam, the SAS code and documentation was developed by the same. It looks like an earlier version of what is now in the car package. The following is an early presentation of ellipses. Very clear and "encyclopedic" in style. However, I think that the article by Friendly, Monette, and Fox is thorough enough. When I get a chance I'll think about how to make "confidence ellipses" with ggplot. It becomes clear quickly that also needed are "confidence shadows" and "regression paralellograms". But existing ggplot functions should do the trick.

Chew, Victor. “Confidence, Prediction, and Tolerance Regions for the Multivariate Normal Distribution.” Journal of the American Statistical Association, vol. 61, no. 315, 1966, pp. 605–617.

ptoche commented 5 years ago

I may not be able to put together much elliptical code until I figure out how to play with geoms and stats. I'm struggling with some basic stuff. Any suggestions on how to get started? FYI I've asked a question over at stackoverflow. I'll keep learning and hopefully I can put something together over the next few months. But don't wait for me... ;-)

clauswilke commented 5 years ago

Unfortunately there isn't a good introductory text, as far as I'm aware. I would suggest reading the source code for simple geoms and stats, such as geom_point() or stat_summary().

ptoche commented 5 years ago

Thanks Claus, that's what I've been doing. I can't seem to get passed a basic problem: accessing the parameters inside the compute_group... Perhaps it's a problem you would be able to diagnose quickly? (only if you have time!) It must be a very basic mistake. I feel that if I overcome this hurdle, the rest should be more manageable.

I have created a geom_shadows and stat_shadows, based on geom_boxplot and stat_boxplot, but with changes like boxplot --> shadows, GeomBoxplot --> GeomShadows, etc..

ggplot(data = mtcars, aes(x = mpg, y = wt)) + 
    geom_shadows() 

Warning message:
Computation failed in `stat_shadows()`:
object 'params' not found 

And indeed, if I print(params) inside compute_group, the parameters cannot be found.

In the files NAMESPACE and DESCRIPTION I have manually edited the names in.

In stat_shadows, I set geom = "shadows". And in StatShadows,

ggproto("StatShadows", Stat,
...
setup_params
...
compute_group

In geom_shadows, I have set stat = "shadows" and in GeomShadows

ggproto("GeomShadows", Geom,
...
setup_data
...
setup_params
...
draw_group

Does it look like there's something obviously wrong? I have pushed these changes to my branch https://github.com/ptoche/ggplot2.

clauswilke commented 5 years ago

It's not immediately clear to me what the problem is. One general hint, though: To develop a new geom or stat you don't need to fork ggplot2 and modify its code base. You can just develop the geom directly, in its own package or just in any .R file.

To debug your case, I suggest you start over with stat_boxplot(), just copy it and rename it, and then slowly make one change at a time and test whether it works or not.

ptoche commented 5 years ago

Thanks Claus. I've followed that strategy actually, but it's taking a long time, because I rebuild the package each time. Now I've tried without the package, but there are too many functions missing. I've copied functions like data_frame or "%||%", but I still get errros like could not find function "grobName". I don't know how to proceed without rebuilding the whole package after each edit, so I'm moving at snail's pace... Not to worry, I'll just take it slow. Besides I'm off for a couple of weeks, so I'll say thanks again and happy new year for now!

clauswilke commented 5 years ago

For missing functions, just place ggplot2::: in front, such as ggplot2:::grobName() etc.

ptoche commented 5 years ago

Thanks again. So I just spent three hours patiently taking boxplot apart. And I may have found something. When does one ever need this?

 extra_params = c("na.rm"),

Inserting my parameters in there seems to solve my immediate problem!

I found this in geom_boxplot, but there's no similar line in stat_boxplot And also, most of the parameters in geom_boxplot are not included in extra_params, without apparent problems. The proto thickens ahem I mean, the plot thickens...

clauswilke commented 5 years ago

https://github.com/tidyverse/ggplot2/blob/e9d4e5dd599b9f058cbe9230a6517f85f3587567/R/geom-.r#L131-L136

ptoche commented 5 years ago

Fantastic! I hadn't seen that one. I'm now starting to get somewhere at last: I've got the data where it's supposed to be and I can get the geom to draw. It's now mostly a matter of getting the calculations organized efficiently. I'll work on the easiest bits first: the ellipse shadows and the parallelograms. I'm not too sure I'll know how to change the plane from (x, y) to (bx, by) to draw the ellipse in parameter space, but when I get there I'll call out for help again. Progress will be slow over the next two weeks as I travel. Thanks again!

paleolimbot commented 5 years ago

There are two open PRs on this, and both appear to be abandoned. I think a statistical explanation in stat_elipse() that goes beyond referencing car::elipse() is needed, and that a more flexible Stat that uses a regression model is an excellent opportunity for an extension package.

lock[bot] commented 4 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/