merliseclyde / BAS

BAS R package for Bayesian Model Averaging and Variable Selection
https://merliseclyde.github.io/BAS/
GNU General Public License v3.0
41 stars 16 forks source link

BPM coefficients don't align with explanatory variables in BPM model - bas and pred.bas objects #49

Closed sheilasw closed 3 years ago

sheilasw commented 4 years ago

Update:

I found a way to pull the coefficients for the model. I still can't pull the correct coefficients using the method I originally asked about.

Method that works, in case it helps someone else reading this:

To confirm I had pulled the correct information, I used this information to predict the first record, and it matched the fitted value from the model for that record.

There's some matching up/data prep to be done to get ready to do the math, which I'm not including here because I'm now working on a different model than the one shown below and the code wouldn't work.

For each variable, calculate: a = (coefficient shrinkage) (observation - sample mean)

Then, Y = Intercept + sum(a)

Original post:

I am using the BAS package to do Bayesian multiple regression modeling for the Coursera course from Duke University, "Bayesian Statistics."

I am experiencing what I think is a bug. When I pull the mean values for the coefficients for the BPM, I have been assuming that:

That isn't what I'm experiencing, however.

I'm using BAS version 1.5.5 and statsr version 0.2.1 in R version 4.0.2. My OS is Mac OSX Catalina 10.15.5.

Here's sample code that reproduces what I'm seeing.

library(dplyr) library(statsr) library(BAS) ` load("movies.Rdata") # data set for modeling dsmodeling <- movies %>% select(audience_score, genre, thtr_rel_year, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win) # bas object lmmovies <- bas.lm( formula = audience_score ~ ., data = dsmodeling, prior = "BIC", modelprior = uniform()) # pred.bas object lmBPM = predict(lmmovies, estimator = "BPM", se.fit = TRUE) ##### Comparing the bas and pred.bas objects ### pred.bas object # BPM model: its model # vmodel = lmBPM$best vmodel # explanatory variable names lmBPM$best.vars # intercept & explanatory variable #s lmBPM$bestmodel ### bas object # intercept & explanatory variable #s for [vmodel]th model # these match what I get from lmBPM$bestmodel lmmovies$which[vmodel] ### get mean values of the coefficients for the [vmodel]th model coefBPM <- coef(lmmovies)$conditionalmeans[vmodel,] coefBPM <- as.data.frame(coefBPM) colnames(coefBPM) <- "Posterior Mean Coefficient" coefBPM`

I noticed the same behavior in the example of this procedure given in the textbook for this course, the version last built on 2020-06-01, on pp. 191-192. On p. 191, the list of variables for the BPM is pulled. On p. 192, the coefficients are pulled. As an example of what I mean, the variable Pop has a non-zero coefficient, however, it is not in the list of coefficients. The variable M, on the other hand, is on the list of coefficients but has a zero coefficient.

haochencoding commented 4 years ago

Hi, many thanks for your solution! I noticed the same problem and tried to extract the coefficients by refitting the model with only those variables in BPM...

katy-sadowski commented 4 years ago

Thanks so much for raising this. It's caused lots of confusion for other students in the forum and sadly no mentor or instructor appears to have been active on Coursera for more than 2 years.

I have an alternate workaround which gets the coefficients from the coef object. I confirmed the results of my workaround match your simpler one using the mle value. But I do hope the maintainers can remove the incorrect code in the textbook and update the package to support extraction of BPM coefficients!

My workaround:

audience_score_lm_coef = coef(audience_score_lm)
audience_score_lm_coef_means = as.data.frame(audience_score_lm_coef$conditionalmeans)
audience_score_lm_coef_means %>%
  filter(feature_filmyes == 0 & dramayes == 0 & mpaa_rating_Ryes == 0 & thtr_rel_year == 0 & oscar_seasonyes == 0 & summer_seasonyes == 0 & imdb_num_votes == 0 & best_pic_statusnot_nominated == 0 & best_pic_statuswon == 0 & best_actor_winyes == 0 & best_actress_winyes == 0 & best_dir_winyes == 0 & top200_boxyes == 0 & runtime != 0 & imdb_rating != 0 & critics_score != 0)

The filter's extremely clunky; I'm sure there's a more elegant way to do this (perhaps even dynamically using audience_score_bpm$best.vars). But you get the idea - the 3 variables != 0 are the ones in the BPM per the predict function.

The "intercept" in this object is also for the centered model. I'm surprised this is never mentioned in the textbook or in the Week 4 lab as a caveat for working with this package.

merliseclyde commented 3 years ago

After some digging, I have found where the issue is. The function coef returns coefficients, conditional means, se's etc for the top n.models with the output sorted by the posterior model probabilities. Therefore the order of the models in the original bas object and the coefficient object do not align.

A short term work around is to find the order of the posterior probabilities and then extract the location for the model that is the BPM, then use that as the index for the coefficients.

BPM.rank = (1:lmmovies$n.models)[topm == vmodel]
lmmovies.coef = coef(lmmovies)
BPM.coef = lmmovies.coef$conditionalmeans[BPM.rank,]
BPM.coef  
cbind(BPM.coef[BPM.coef != 0], lmmovies$mle[[vmodel]])

Working on the code to extract just the BPM coefficients and other info.

merliseclyde commented 3 years ago

Updated code in Book Chapter to show correct way to extract coefficients under HPM, MPM and BPM StatsWithR/book#80

Kosmoshuynh commented 3 years ago

Dear Merliseclyde/Bas,

Many thanks for your email. I wonder do you and others check the INTERCEPT? It is not a "true" intercept but the "mean" of observations and thus you can see that they are the same value for all models. I have found a way to extract the intercept. If you and others need it please let me know.

In addition, I deeply digged into BPM and I saw that it is weird in some cases. For example, the BPM show 7 selected variables but there is one that its coefficient is 0. And also, because BPM try to get the closest loss to of MBA (full model). I did check selected variables with normal linear model and the result showed many of them are INSIGNIFICANT. In my opinion, the strongest poitn of BMA is to handle uncertainty and also exclude autocorrelation (multicollinearity). So, if we select the one close to BMA to get a closest loss model is ... high risk. For example, in my case the HPM with 4 predators (7 with BPM) and all of them are significant. The adjusted R2 is higher than the one of BPM. RMSEs of BMA, HPM and BPM are 11.21, 11.24 and 11.235 respectively. The mean is 142.47 and thus the differences of those losses are not significant. Generally, we add three more predictors to get a weird result, high uncertainty and multicollinearity as well. In my opinion, BPM can not applied for all cases instead of BHM.

Best regards, Hien Huynh

On Fri, Oct 30, 2020, 02:36 Merlise Clyde notifications@github.com wrote:

Updated code in Book Chapter to show correct way to extract coefficients under HPM, MPM and BPM StatsWithR/book#80 https://github.com/StatsWithR/book/issues/80

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/merliseclyde/BAS/issues/49#issuecomment-719117480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ4MIRMDWJ7F2D56UWQFYCLSNIKCTANCNFSM4O6ZCMWQ .