tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

rthis() #8

Open hadley opened 8 years ago

hadley commented 8 years ago

From @jennybc:

guess you could call it rthis(), in the spirit of rnorm() et al. where input is a numeric vector of observed data. Then it generates n observations from some reasonable def'n of the empirical distribution. It could literally resample or do convex bootstrap or fit a kernel density estimate, etc.

jennybc commented 7 years ago

My archaeological dig has yielded a really simple version of rthis(). It implements a smooth bootstrap with Gaussian kernel. It's described many places, including by Silverman in Density Estimation for Statistics and Data Analysis (1986) in section 6.4.1 "Simulating from density estimates". I included a bit of visual and numerical exploration as a sanity check.

library(ggplot2)
library(tibble)
library(purrr)
library(dplyr)
library(tidyr)

rthis <- function(n, data, h) {
  if (missing(n)) {
    n <- length(data)
  }
  if (missing(h)) {
    h <- bw.nrd(data) ## or use one of the other bw selectors??
  }
  xbar <- mean(data)
  sigsq <- var(data)
  ystar <- sample(data - xbar, size = n, replace = TRUE)
  epsilon <- rnorm(n)
  denom <- sqrt(1 + h^2 / sigsq)
  xbar + (ystar + h * epsilon) / denom
}

## try it with old faithful data
fdur <- faithful$eruptions

B <- 100
## can also play with h like so: h = bw.nrd(fdur) * 1.3)
df <- rerun(B, rthis(data = fdur)) %>%
  set_names(as.character(seq_len(B))) %>%
  as_tibble() %>%
  add_column(fdur, .before = 1) %>%
  gather(key = "src", value = "duration")

for_gg <- df %>%
  filter(src %in% c("fdur", as.character(1:5)))
ggplot(for_gg, aes(duration)) + geom_freqpoly() +
  geom_rug() + facet_wrap(~ src)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


summ_stats <- df %>%
  filter(src != "fdur") %>%
  group_by(src) %>%
  summarise(mean = mean(duration), var = var(duration))
tribble(
   ~stat,    ~actual,           ~sample_avg,
  "mean", mean(fdur), mean(summ_stats$mean),
   "var",  var(fdur),  mean(summ_stats$var)
)
#> # A tibble: 2 × 3
#>    stat   actual sample_avg
#>   <chr>    <dbl>      <dbl>
#> 1  mean 3.487783   3.495127
#> 2   var 1.302728   1.312251