stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
385 stars 132 forks source link

survival: different results with R `v3.6.3` vs `v4.2.0`, but same `rstanarm` version and seed #582

Open anddis opened 1 year ago

anddis commented 1 year ago

Summary:

Same rstanarm version from feature/survival (version 2.21.2), same seed, different results between R 3.6.3 and 4.2.0.

Description:

As per the summary, I have the same version of rstanarm from feature/survival installed on R version 3.6.3 and 4.2.0. Even with the same seed, I get different draws form the posterior distribution.

The 2 versions of rstanarm were not installed at the same time. Could it be that some changes to the code potentially related to this issue were introduced, yet the version of rstanarm was not updated? Otherwise, do you have any idea about what the different results could be due to?

But above all: do you know I can obtain consistent results between R 3.6.3 and 4.2.0?

Output:

(reproducible code below)

R 3.6.3 + rstanarm 2.21.2

stan_surv
 baseline hazard: weibull
 formula:         Surv(futime, death) ~ mspike
 observations:    241
 events:          225 (93.4%)
 right censored:  16 (6.6%)
 delayed entry:   no
------
              Median    MAD_SD    exp(Median)
(Intercept)   -10.15258   0.62484        NA  
mspike         -0.12206   0.17364   0.88510  
weibull-shape   1.18821   0.06543        NA  

R 4.2.0 + rstanarm 2.21.2

stan_surv
 baseline hazard: weibull
 formula:         Surv(futime, death) ~ mspike
 observations:    241
 events:          225 (93.4%)
 right censored:  16 (6.6%)
 delayed entry:   no
------
              Median    MAD_SD    exp(Median)
(Intercept)   -10.14793   0.74071        NA  
mspike         -0.11830   0.18188   0.88843  
weibull-shape   1.19225   0.07253        NA 

Reproducible Code:

library(survival)
library(rstanarm)

surv <- stan_surv(Surv(futime, death) ~ mspike, 
                  data = mgus, 
                  basehaz = "weibull",
                  chains = 1, 
                  seed = 1234)

print(surv, digits = 5)

RStanARM Version:

2.21.2

R Version:

3.6.3 and 4.2.0

Operating System:

Mac OS 10.14.6

anddis commented 1 year ago

Update: with basehaz = "exp" or basehaz = "weibull-aft" I get identical results. between R 3.6.3 and 4.2.0.

Were the functions for the Weibull distribution with PH parametrisation rewritten/modified at some point?

sambrilleman commented 1 year ago

Hi @anddis - sorry to hear this, esp. if the lack of reproducibility is causing you a nuisance!

Looking at the commit history, I don't think we've pushed changes to the feature/survival branch in about 9 months, so I doubt that it is related to any refactoring (although I'm not certain - e.g. not sure exactly the timeline on R 3.6.3 and 4.2.0).

Perhaps this sounds a little related to #556 that you reported (and nobody has time to investigate unfortunately!).

Similar to the other issue, the fact that weibull-aft gives consistent results, but weibull doesn't, does feel a bit worrying. Sad truth is that I'm not sure I can find the time to truly sit down and debug this atm unfortunately. I think we'd have to write a unit test to try identify the discrepancy across the different baseline hazards. Perhaps someone else involved with the repo here might be able to assist. 🤞 😞

jburos commented 1 year ago

Hi @anddis,

This issue came up in a recent discussion with @bgoodri, who mentioned that exact reproducibility isn't guaranteed by Rstan across R versions, even for the same data + model code + seed. There are many things that can change the results across versions of R.

Personally, if I need exact reproducibility, I package the R environment + code in a docker image ..

However, we've seen other reproducibility issues (e.g. #556 ) that may be at play here. It's hard for me to tell from your example if these are substantive differences, or if they may be considered in a category of chain-to-chain variability. Do you have thoughts on this, perhaps from non-reproducible examples?

Otherwise I think the next step would be to run your example here with more chains to see if the results are more similar on average.

anddis commented 1 year ago

Hello Sam and Jacki. Thank you for your replies.

@jburos: I agree that the ideal solution is a docker image. I also understand that exact reproducibility across R versions is not guaranteed. However, I'm surprised that results are in fact reproducible between R3.6.3 and R4.2.0 for some of the distributions/parametrisations but not for others (see my second post in this thread). But I know too little about the internal workings of stan_surv and Rstan to be able to speculate on why this happens.

Issue https://github.com/stan-dev/rstanarm/issues/556#issue-1052153445 seems more concerning to me. The R version is the same, the only thing that changes is the order of the observations. Plus, as you can see from the plot in my second post (sorry for the low quality of the image), the posterior draws are identical up to iteration ≈500 and then they start to diverge. And again, results are identical regardless of the ordering of the observations for some distributions/parametrisations but not for others.

jburos commented 1 year ago

Thanks @anddis - I agree this is concerning, and I wanted thank you for the reproducible examples. It's on our radar ..

sambrilleman commented 1 year ago

Yeah I agree the fact the behaviour differs by distribution is odd and worrying.

Although, I had missed the fact that the draws are identical for warmup and differ for post warmup. That feels really odd! I can't recall anything in stan_surv where we have differential behaviour for warmup vs post warmup, conditional on distribution - I thought the warmup vs post warmup behaviour was all just outsourced to rstan. But maybe I'm forgetting.