stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
385 stars 132 forks source link

Priors don't affect result #539

Open LasWin opened 3 years ago

LasWin commented 3 years ago

This is more likely due to my incompetence rather than an actual issue with the library but I've noticed that no matter how restrictive priors I give to stan_glm (my use case is logistic regression), the resulting model doesn't differ from a regular frequentist logistic regression. If I use either core rstan or blmr, I am able to influence the resulting model using informative priors.

My use case is basically that I have data based on random sampling which is outdated and newer data which has heavy selection bias and I aim to use the data based on random sampling to create a prior for intercept that should optimally combat the selection bias (I also use propensity scores but that's another story)

Below is a code with toy data with one predictor variable. Essentially same model is built four times, one using the regular glm-function and one Bayesian version each with rstan, rstanarm and blmr.

`library(rstanarm) library(rstan) library(brms)

create prior data

random_y <- rbinom(3000, 1, prob=0.07)

define prior mean

mu <- prop.table(table(random_y))[2]

data for modeling

y <- rbinom(5000, 1, prob=0.35)

x <- rnorm(5000, mean=y, sd=2)

df <- data.frame(x, y)

new observations for prediction

newObs <- data.frame(rnorm(100, mean=0.5, sd=3))

names(newObs)[1] <- "x"

frequentist logistic regression

fit1 <- glm(y ~ x, data = df, family="binomial")

bayesian logistic regression using rstanarm

fit2 <- stan_glm(y ~ x, data=df, family="binomial", prior_intercept = student_t(location=log(mu/(1-mu)), scale=0.05), prior=student_t(location=0, scale=0.5))

with Stan

stanMod <- "data {

int N; // number of trials int N_test; // number of trials test data int y [N]; // number of hits vector [N] x; // predictive variable vector[N_test] x_test;

}

parameters {

// The (unobserved) model parameters that we want to recover real alpha; real beta;

}

model {

// A logistic regression model relating the x to y y ~ bernoulli_logit(alpha + beta * x);

// Prior models for the unobserved parameters alpha ~ normal(log(0.07/(1-0.07)), 0.005); beta ~ normal(0, 0.5);

}

generated quantities { vector[N_test] y_test; for(i in 1:N_test) { y_test[i] = alpha + beta*x_test[i]; } }"

StanDat <- list(N=nrow(df),x=df$x,y=df$y, N_test=nrow(newObs),x_test=newObs$x)

fit3 <- stan(model_code = stanMod, data=StanDat, chains=4)

with brms

brmPriors <- c(set_prior("normal(0,0.5)", class = "b", coef = "x"), set_prior("normal(log(0.07/(1-0.07)),0.05", class = "Intercept"))

fit4 <- brm(y~x, data=df, family = "bernoulli", prior=brmPriors)

results

coef(fit1) coef(fit2) mean(extractStan$alpha) mean(extractStan$beta) fixef(fit4)`

As you can see, fit1 and fit2 (glm, rstanarm) correspond to each other, while fit3 and fit4 (rstan, blmr) are also a pair. In my mind, the rstanarm result should be similar to the rstan and blmr fits but something goes awry.

As said, there likely are errors in how I set up the model. Coming from social sciences I am not a particularly accomplished statistician even in frequentist tradition and I am pretty inexperienced in Bayesian methods.

I prefer rstanarm as an interface to Stan over others so would like to use it :)

I use R 4.0.4 on Windows 10.