Add sample quantile option(s) to calculate function?

jbourak commented 4 years ago

Hi there,

I am currently working with @ttimbers to develop a new statistical inference course for the University of British Columbia. We are planning to use the infer package in this course; however, I realized that the calculate function does not have the option to calculate a quantile for the sample statistic (using stats::quantile). This addition would be useful as it would allow us to easily calculate confidence intervals for something like an N-year flood in a similar manner as we would for a mean using infer.

Would a PR to include this feature be welcomed?

If so, I was thinking that one example of a use case would look like something like this:

flow_maxima %>%
  specify(response = max_flow) %>%
  generate(type = "bootstrap", reps = 1000) %>%
  calculate(stat = "quantile", probs = 0.99, type = 7) %>%
  get_ci(type = "percentile", level = 0.95)

Although, as a novice, I am unsure of how one would best get around the fact that stats::quantile can return multiple quantiles, because I assume something like this wouldn't make much sense in the context of the workflow of the infer package...

... %>%
  calculate(stat = "quantile", probs = c(0.1, 0.2, 0.3), type = 7)

I suppose one easy (but constraining) workaround would be to have several pre-set probability & type options for the stat argument such as "0.80 quantile", "0.90 quantile", "0.95 quantile", and "0.99 quantile", which could be used like so:

flow_maxima %>%
  specify(response = max_flow) %>%
  generate(type = "bootstrap", reps = 1000) %>%
  calculate(stat = "0.99 quantile") %>%
  get_ci(type = "percentile", level = 0.95)

but I think this would be sub-optimal.

cc: @vincenzocoia @Lourenzutti

echasnovski commented 4 years ago

Hello, @jbourak!

Thank you for the suggestion. Several notes:

You can compute quantiles without {infer}'s help by using ... %>% dplyr::summarise(stat = quantile(max_flow, probs = 0.9)). One downside here is that it needs to now the name of a variable you specify()ed.
We even considered supporting arbitrary functions as input to calculate(), but stumbled upon some methodological issues. #175 has somewhat lengthy discussion. TL;DR: it doesn't play nice with hypothesize().

In principle, this might be useful and is totally doable, but I fear possible confusion if it is used with hypothesize(). Do you think that, from a teacher's perspective, using summarize() will be good alternative here?

jbourak commented 4 years ago

Gotcha. Adding the ability to give any function as an input was another idea we had as well, but after reading that discussion I see why you decided against it. If you think adding this specific option would cause too much confusion with hypothesize, using summarize() for the cases where we want to calculate a confidence interval for unsupported statistics should be ok!

simonpcouch commented 4 years ago

The suggestion and thoroughness are very much appreciated, @jbourak! :-)

For now, I think using summarize() will be the preferred approach. I agree that this functionality would be really nice to have, but would be better situated as a special case of some related function to calculate() (or related framework to infer, generally) that can take in arbitrary functions as input.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / infer

Add sample quantile option(s) to calculate function? #336