"Error in contrasts" when using bas.predict

muraiki commented 7 years ago

Myself and some other students in the Coursera course on Bayesian Statistics that you helped teach are running into problems using bas.predict. When we try to use that function with the result of bas.lm along with new data, we are getting the following error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

We've made sure that the new data is a dataframe with all of the required columns, including making sure that the factor levels in the new dataframe match the factor levels found in the dataframe fed to bas.lm. As such, every factor should have at least 2 levels.

We're all pretty much stuck on this, so if you could respond here I will relay the message to the rest of the class. Thank you!

merliseclyde commented 7 years ago

If you can send me a minimal reproducible example I will double check. (clyde@stat.duke.edu)

A couple of quick check that may be helpful - save the design matrices from

model.matrix(formula, data=dataframe)

for the training and test data and verify that they have the same number of columns. There may be one that is missing from one or the other and we'll need to handle that edge case properly.

muraiki commented 7 years ago

After reading your code at https://github.com/StatsWithR/figures/blob/master/04_bayesian_statistics/week_04/5.4.4_decisions_under_model_uncertianty/R/5.4.4_decisions.md#prediction-with-a-new-data-set I think that we are not using the predict function correctly. Should newdata include enough data to exhaust all the possible factor levels for each factor column? We're all trying to use a newdata containing only a single row but with the same columns.

I'll work on a minimal reproducible example, but here's the output from model.matrix:

> mm <- model.matrix(audience_score ~ feature_film + drama + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data=na.omit(movies))

> ncol(mm)
[1] 15

> zootopia <- data.frame(
  feature_film = c("yes"),
  drama = c("no"),
  mpaa_rating_R = c("no"),
  thtr_rel_year = c(2016),
  oscar_season = c("no"),
  summer_season = c("no"),
  imdb_rating = c(8.1),
  imdb_num_votes = c(277116),
  best_pic_nom = c("no"),
  best_pic_win = c("no"),
  best_actor_win = c("yes"),
  best_actress_win = c("yes"),
  best_dir_win = c("yes"),
  top200_box = c("yes"),
  audience_score = c(NA)  # I also tried putting the correct score here
)

> levels(zootopia$feature_film) <- c("yes", "no")
> levels(zootopia$oscar_season) <- c("no", "yes")
> levels(zootopia$summer_season) <- c("no", "yes")
> levels(zootopia$best_pic_nom) <- c("no", "yes")
> levels(zootopia$best_pic_win) <- c("no", "yes")
> levels(zootopia$best_actor_win) <- c("yes", "no")
> levels(zootopia$best_actress_win) <- c("yes", "no")
> levels(zootopia$best_dir_win) <- c("yes", "no")
> levels(zootopia$top200_box) <- c("yes", "no")

> mm2 <- model.matrix(audience_score ~ feature_film + drama + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data=zootopia)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

muraiki commented 7 years ago

Dr. Clyde figured out the problem, which of course was with my code and not her library. :) I'm posting this here since I know some other students have run into the same problem with their code.

In one of the intermediate steps I was using rbind to join the new data to a subset of the movies dataframe. In this process, rbind was converting numeric columns to factors. There's a particular way to go about creating a new dataframe and appending it to the existing movies dataframe such that all of the data types are preserved. See the example below:

library(BAS)
set.seed(42)

d <- data.frame(
  likes.cats = c(rep("yes", 60), rep("no", 40)),
  likes.dogs = c(rep("yes", 30), rep("no", 70)),
  fish.eaten.per.year = c(rnorm(100, mean=20, sd=5)),
  donated.to.cats = c(rnorm(50, mean=50, sd=10), rep(0, 50))
)

m <- bas.lm(donated.to.cats ~ likes.cats + likes.dogs + fish.eaten.per.year,
            data=d, prior="BIC", modelprior=uniform())

m
summary(m)

# The below ensures that the data types are correct.
# There are other ways of doing this, but this worked
# for Dr. Clyde and me, so I'm sticking to it!
nd <- data.frame("yes", "no", 100, 50)
colnames(nd) <- colnames(d)

dnd <- rbind(d, nd)

predict(m, dnd[101,])  # yay it works

merliseclyde / BAS

"Error in contrasts" when using bas.predict #5