Closed erikseulean closed 1 year ago
Hi, glad you're finding the package useful!
It is my understanding that the preferred predict()
method behavior is to require that newdata
contain columns for all model parameters.
This is true for other models as well. For example, using lm()
, the predict()
method will error unless you include all covariates in newdata
. Here's a quick example:
fit <- lm(mpg ~ hp + wt + disp, data = mtcars)
If I use predict()
will all covariates, there's no error:
pred <- predict(fit, mtcars[1:5, c('hp', 'wt', 'disp')])
Now if I use predict()
where I've left out the "disp" covariate, I get the following error:
pred <- predict(fit, mtcars[1:5, c('hp', 'wt')])
Error in eval(predvars, data, env) : object 'disp' not found
I think you have two options:
newdata
(omit variables that are missing). This is probably less good as you could be introducing biases by leaving out important variables.newdata
. You can always make the values for those variables the same if they are not important in your prediction. For example, if you had some variables like brand
that you didn't want to vary in your prediction, just set the brand
variable in newdata
to all be the same brand
.Thanks for the fast reply!
This is not about the covariate. The covariates are all there, however, one covariate which is a factor does not contain one level in the test dataset, while it is present in the train dataset.
I'll try reproducing this again later.
I can't reproduce this result. Here's an example using simulated data.
I have another packaged called {cbcTools} that you can use to simulate choice-based conjoint survey data with.
install.packages(c('logitr', 'remotes'))
remotes::install_github('jhelvy/cbcTools')
Use {cbcTools} to make a data set with random choices:
library(cbcTools)
library(logitr)
profiles <- cbc_profiles(
price = seq(1, 4, 0.5), # $ per pound
type = c('Fuji', 'Gala', 'Honeycrisp'),
freshness = c('Poor', 'Average', 'Excellent')
)
design <- cbc_design(
profiles = profiles,
n_resp = 900, # Number of respondents
n_alts = 3, # Number of alternatives per question
n_q = 6 # Number of questions per respondent
)
data <- cbc_choices(
design,
obsID = 'obsID'
)
Now estimate a logit model
model <- logitr(
data = data,
obsID = 'obsID',
outcome = 'choice',
pars = c('price', 'type', 'freshness')
)
Make a new dataset where the "Fuji"
level is not present for the type
feature:
new <- data
new[which(new$type == "Fuji"),]$type <- "Gala"
unique(new$type)
[1] Gala Honeycrisp
Levels: Fuji Gala Honeycrisp
Make predictions using model
and the new
data:
pred <- predict(
object = model,
newdata = new,
obsID = "obsID",
returnData = TRUE
)
I get predicted probabilities without any errors.
I tried again, but I get the same issue:
sum(colnames(train) == colnames(test)) == length(colnames(train))
sum(colnames(train) == colnames(test)) == length(colnames(test))
TRUE
TRUE
Here are the two variables (obfuscated)
unique(train$info)
Levels:
'C' 'F' 'G' 'H' 'M' 'R'
unique(test$info)
Levels:
'C' 'F' 'G' 'H' 'M'
As you can see the difference is the number of levels one has compared to the other. If I remove the covariate from pars, and everything else stays the same, it works just fine.
I'll read a bit more the code to figure out what's the issue.
Make a new dataset where the
"Fuji"
level is not present for thetype
feature:new <- data new[which(new$type == "Fuji"),]$type <- "Gala" unique(new$type) [1] Gala Honeycrisp Levels: Fuji Gala Honeycrisp
Make predictions using
model
and thenew
data:pred <- predict( object = model, newdata = new, obsID = "obsID", returnData = TRUE )
I get predicted probabilities without any errors.
I can actually reproduce the error with this case by doing the following:
new$type = factor(c(rep("Gala", 8100), rep("Fuji", 8100)))
unique(new$type)
Gala Fuji
Levels:
'Fuji''Gala'
This is actually a valid example I think, if you read the data from a file, where it won't maintain the knowledge of the previous levels. In your situation, the levels are the one that are initially set, and you only change the values in the column.
Okay this makes sense and I can reproduce this error. This is the message I get:
Error in predictParCheck(object, X) :
The coefficient names for the provided model do not correspond to variables in "newdata".
Expect columns:
price, typeGala, typeHoneycrisp, freshnessAverage, freshnessExcellent
Encoded column names from provided `newdata` object:
price, typeGala, freshnessAverage, freshnessExcellent
If you have a factor variable in "newdata", check that the factor levels match those of the data used to estimate the model.
Perhaps I need to update my error messages as some of this is not quite related. It's the last message that is most related to this situation though. It says to check that the factor levels match those of the data used to estimate the model.
This does not appear to be an issue with the package or the model you estimated but rather the data itself. If you read in an external data file that is missing one of the factor levels used to estimate the model, I think that it should actually error here. This makes sense to me as the model is expecting factor levels that are missing.
I don't have an immediate work around for this other than to try to add the missing factor level to the data you read in. Perhaps this SO might be helpful?
So my workaround is doable, but incorrect. I'll actually check what lm
does for this as I'm curious (but I'm fairly convinced it doesn't error out). From a user stand point it does not feel correct to have to supply all the factor levels, I feel that the error in the code is checking the wrong thing (I think it was rather intended to check the matching of the covariates rather than the levels).
There are scenarios where the levels used to fit the model (present in the train) are not available to the user when the predict is called (one reason that I can think about is when the train data is huge and you want to dispose it from the memory after training, and the predictions are done at a way later point in time).
The other logical question is why do you need to provide factor levels that you don't need ? The predictions can happen without these levels no ?
Another scenario where this would fail is if you save your model to disk and then load it on a different machine where you don't have the data (which is somewhat common if you train and use on different clusters):
saveRDS(mnl_pref, "model.rds")
my_model <- readRDS("model.rds")
Using it on the new cluster would have to ensure the levels, which is again incorrect in my opinion.
From a correctness standpoint I could see two ways of doing this:
In my situation I can go by adding the factor levels, however, 6 months down the line I can see how my predictions are based on some streamed input where I don't have access to the data anymore.
Is there a library that you know that works on the principle mentioned above (where you have to alter the test data) as an example ?
If you are willing to accept contributors to your library, and decide on a possible soultion, I am happy to get my hands dirty, read the code and send a PR.
All good points here. For sure I should revise the input checking to separate the two different types of issues here (whether there are missing variables, or whether there are missing factor levels for a factor type variable).
Whenever something like this comes up, I try to check how other packages handle the situation. And you're right about not expecting an error. Here's an example for lm()
, which doesn't error if a factor level is missing:
library(palmerpenguins)
# Fit a model
fit <- lm(body_mass_g ~ flipper_length_mm + species, data = penguins)
# Check levels for penguins$species
levels(penguins$species)
[1] "Adelie" "Chinstrap" "Gentoo"
# Make new dataset for prediction, missing the "Chinstrap" level for species
new <- penguins[c(1:5, 171:175), ]
new$species <- factor(c(rep("Adelie", 5), rep("Gentoo", 5)))
# Check levels for new$species
levels(new$species)
[1] "Adelie" "Gentoo"
# Predict
pred <- predict(fit, new)
There is in fact no error at all with lm()
in this situation. So for sure the input check that you identified should keep the check about erroring if a covariate is missing, but it should not error if a factor level is missing.
I expect that fixing this will involve more than just changing the input check functions. How I handle data encoding internally prior to estimating the models probably assumes that all levels are present (that's my guess as to why I have this check here in the first place).
In particular, I expect the recodeData()
function may have to be revised. The reason is because it sets a reference level for categorical variables based on the factor levels. This gets called when you use predict()
to encode the data prior to making predictions. If a factor level is missing, the function may encode a different level as the reference level. This will cause the resulting dummy-coded coefficients to be mis-aligned with the names of the model parameters.
The factor order matters for categorical variables. See for example this article in the documentation, which shows how you can change the reference level for categorical variables. I want to make sure this functionality is preserved as this is how most R packages work.
A possible work-around is to add some code to internally set the missing factor levels. The data are stored in the model, so for categorical variables we should be able to get their levels (and their order of those levels). If any are missing in the newdata
, we should be able to add those additional levels without actually modifying the data (there must be a way to do this). That (I hope) will address the issue.
If you have the time to try to walk through the code, I'm happy to accept a PR. Otherwise I'll have to check this again after finals week (next week) is over (lots and lots of grading to do).
This is not a blocking issue for me, so there's no reason to try to fit it anywhere in your schedule! I will find some time to read the library and understand the code (I wanted to do it anyways mostly to get some more insights about the implementation) and see if I can get a potential fix in.
Thanks a lot for taking your time on this!
I believe I may have fixed this issue on the latest branch. Can you try installing from that branch to test your data?
remotes::install_github("jhelvy/logitr@Rc-v1.0.0")
Here's an example using the yogurt
data:
library(logitr)
model <- logitr(
data = yogurt,
outcome = 'choice',
obsID = 'obsID',
pars = c('price', 'feat', 'brand')
)
Use the model to make predictions using a subset of the original data:
data <- subset(
yogurt, obsID %in% c(42, 13),
select = c('obsID', 'alt', 'price', 'feat', 'brand'))
probs <- predict(model, newdata = data, obsID = "obsID")
Now remove the 4th alternative from the data (eliminating the "yoplait"
level from brand
) and predict again:
data2 <- data[which(data$alt != 4),]
probs <- predict(model, newdata = data2, obsID = "obsID")
No errors (hopefully)
Sorry for coming back so late. This seems to be working now! Thanks a lot for looking into it.
Okay great thanks for checking it out! The changed were merged into the latest version, 1.0.0.
Hello,
Thanks for providing the library, I found out about it a few days ago and it is truly amazing!
I am trying to run a model that has one feature set as a factor, however it errors out when the factor levels in the
newdata
field provided topredict
is different than the one provided in the train dataset.I think the error comes from here: https://github.com/jhelvy/logitr/blob/e478e02d76903465d28197e9f0d343a68b7ae4bd/R/inputChecks.R#L156
In my case, the values provided in
newdata
are a subset of the ones provided in the train dataset. This case should be fine right, since no observation would have coefficients undetermined.