rmcelreath / rethinking

Statistical Rethinking course and book package
2.16k stars 605 forks source link

sim() on ulam fit object: n argument not working, always returning 1000 samples #339

Open gsverhoeven opened 2 years ago

gsverhoeven commented 2 years ago

Hi there,

We are currently working our way through the 2nd edition book, and went along with the transition to cmdstanr to fit the models. So I have R 4.1 installed, together with the current Github versions of cmdstanrand rethinking.

The problem is that after fitting a model with ulam(), when calling sim() on the object it always return 1000 samples, irrespective of the value of the n argument.

I looked a bit under the hood at the package source code, and discovered that sim() on a model fit using ulam() actually calls the sim_ulam_new() function. Here comes the weird part: if I call rethinking::sim_ulam_new() the n argument works, but if I call rethinking::sim() it does not, and always gives me 1000 samples. Below is a minimal reproducible example, I took the code from the ulam() help file.

Any idea what is going on here?

# MVE sim problem

```{r}
library(rethinking)
data(chimpanzees)

# don't want any variables with NAs
# also recode condition to an index {1,0} -> {1,2}
d <- list( 
    pulled_left = chimpanzees$pulled_left ,
    prosoc_left = chimpanzees$prosoc_left ,
    condition = as.integer( 2 - chimpanzees$condition ) ,
    actor = as.integer( chimpanzees$actor ) ,
    blockid = as.integer( chimpanzees$block )
)

# simple logistic regression
m1 <- ulam(
    alist(
        pulled_left ~ bernoulli(theta),
        logit(theta) <- a + bp[condition]*prosoc_left  ,
        a ~ normal(0,4),
        bp[condition] ~ normal(0,1)
    ) ,
    data=d, chains=2, cores=1 , sample=TRUE )

Loading required package: rstan Loading required package: StanHeaders Loading required package: ggplot2 rstan (Version 2.21.3, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Loading required package: cmdstanr This is cmdstanr version 0.4.0

Attaching package: ‘rethinking’

The following object is masked from ‘package:rstan’:

stan

The following object is masked from ‘package:stats’:

rstudent

Compiling Stan program... Running MCMC with 2 sequential chains, with 1 thread(s) per chain...

Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup) Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup) Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup) Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup) Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup) Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup) Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling) Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling) Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling) Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling) Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling) Chain 1 Iteration: 1000 / 1000 [100%] (Sampling) Chain 1 finished in 1.7 seconds. Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup) Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup) Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup) Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup) Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup) Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup) Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling) Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling) Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling) Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling) Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling) Chain 2 Iteration: 1000 / 1000 [100%] (Sampling) Chain 2 finished in 0.9 seconds.

Both chains finished successfully. Mean chain execution time: 1.3 seconds. Total execution time: 3.0 seconds. Registered S3 method overwritten by 'data.table': method from print.data.table

calling sim():

dim(sim(m1, n = 10))

[1] 1000 504

dim(sim_ulam_new(m1, n = 10))

[1] 10 504

isosphere commented 2 years ago

I have the same issue. Try increasing the number of iterations in your ulam call with iter=N. I haven't dug into the code, but I think sim is silently stopping us from sampling more than we have chains to sample from (chains times iter minus warmup).

I had the same issue using the post <- extract.samples(..., n=10000); sim(..., post=post) approach, too.

Using sim_ulam_new did not address the issue for me until I increased the number of iterations in my ulam call.

isosphere commented 2 years ago

I asked Richard about this in lecture and this is intentional, gotta run your chains longer.

gsverhoeven commented 2 years ago

Hi Matthew, thanks for responding! I am afraid i did not make my issue clear enough.

Your issue is wanting to simulate more outcomes than there are draws from the posterior.

My problem is that sim() is completely unresponsive to the n = argument. If i remember correctly, i had in fact ran my chains longer (2000 samples), but it still gave me 1000 samples/outcomes. Then I tried the other way, so asking for less samples (see the example above where I set n = 10), and it still gives me 1000 samples.

So I think we are facing different issues.

Regards, Gertjan