Predicted names don't match

erikseulean commented 1 year ago

Hello,

Thanks for providing the library, I found out about it a few days ago and it is truly amazing!

I am trying to run a model that has one feature set as a factor, however it errors out when the factor levels in the newdata field provided to predict is different than the one provided in the train dataset.

I think the error comes from here: https://github.com/jhelvy/logitr/blob/e478e02d76903465d28197e9f0d343a68b7ae4bd/R/inputChecks.R#L156

In my case, the values provided in newdata are a subset of the ones provided in the train dataset. This case should be fine right, since no observation would have coefficients undetermined.

jhelvy commented 1 year ago

Hi, glad you're finding the package useful!

It is my understanding that the preferred predict() method behavior is to require that newdata contain columns for all model parameters.

This is true for other models as well. For example, using lm(), the predict() method will error unless you include all covariates in newdata. Here's a quick example:

fit <- lm(mpg ~ hp + wt + disp, data = mtcars)

If I use predict() will all covariates, there's no error:

pred <- predict(fit, mtcars[1:5, c('hp', 'wt', 'disp')])

Now if I use predict() where I've left out the "disp" covariate, I get the following error:

pred <- predict(fit, mtcars[1:5, c('hp', 'wt')])
Error in eval(predvars, data, env) : object 'disp' not found

I think you have two options:

Estimate a model that only includes the covariates that are in your newdata (omit variables that are missing). This is probably less good as you could be introducing biases by leaving out important variables.
Include the missing variables in newdata. You can always make the values for those variables the same if they are not important in your prediction. For example, if you had some variables like brand that you didn't want to vary in your prediction, just set the brand variable in newdata to all be the same brand.

erikseulean commented 1 year ago

Thanks for the fast reply!

This is not about the covariate. The covariates are all there, however, one covariate which is a factor does not contain one level in the test dataset, while it is present in the train dataset.

I'll try reproducing this again later.

jhelvy commented 1 year ago

I can't reproduce this result. Here's an example using simulated data.

I have another packaged called {cbcTools} that you can use to simulate choice-based conjoint survey data with.

install.packages(c('logitr', 'remotes'))
remotes::install_github('jhelvy/cbcTools')

Use {cbcTools} to make a data set with random choices:

library(cbcTools)
library(logitr)

profiles <- cbc_profiles(
    price     = seq(1, 4, 0.5), # $ per pound
    type      = c('Fuji', 'Gala', 'Honeycrisp'),
    freshness = c('Poor', 'Average', 'Excellent')
)

design <- cbc_design(
    profiles = profiles,
    n_resp   = 900, # Number of respondents
    n_alts   = 3,   # Number of alternatives per question
    n_q      = 6    # Number of questions per respondent
)

data <- cbc_choices(
    design,
    obsID = 'obsID'
)

Now estimate a logit model

model <- logitr(
    data = data, 
    obsID = 'obsID', 
    outcome = 'choice',
    pars = c('price', 'type', 'freshness')
)

Make a new dataset where the "Fuji" level is not present for the type feature:

new <- data
new[which(new$type == "Fuji"),]$type <- "Gala"
unique(new$type)

[1] Gala       Honeycrisp
Levels: Fuji Gala Honeycrisp

Make predictions using model and the new data:

pred <- predict(
    object = model, 
    newdata = new, 
    obsID = "obsID", 
    returnData = TRUE
)

I get predicted probabilities without any errors.

erikseulean commented 1 year ago

I tried again, but I get the same issue:

sum(colnames(train) == colnames(test)) == length(colnames(train))
sum(colnames(train) == colnames(test)) == length(colnames(test))
TRUE
TRUE

Here are the two variables (obfuscated)

unique(train$info)
Levels:
'C' 'F' 'G' 'H' 'M' 'R'

unique(test$info)
Levels:
'C' 'F' 'G' 'H' 'M'

As you can see the difference is the number of levels one has compared to the other. If I remove the covariate from pars, and everything else stays the same, it works just fine.

I'll read a bit more the code to figure out what's the issue.

erikseulean commented 1 year ago

Make a new dataset where the "Fuji" level is not present for the type feature:
new <- data
new[which(new$type == "Fuji"),]$type <- "Gala"
unique(new$type)

[1] Gala       Honeycrisp
Levels: Fuji Gala Honeycrisp
Make predictions using model and the new data:
pred <- predict(
    object = model, 
    newdata = new, 
    obsID = "obsID", 
    returnData = TRUE
)
I get predicted probabilities without any errors.

I can actually reproduce the error with this case by doing the following:

new$type = factor(c(rep("Gala", 8100), rep("Fuji", 8100)))
unique(new$type)
Gala Fuji
Levels:
'Fuji''Gala'

This is actually a valid example I think, if you read the data from a file, where it won't maintain the knowledge of the previous levels. In your situation, the levels are the one that are initially set, and you only change the values in the column.

jhelvy commented 1 year ago

Okay this makes sense and I can reproduce this error. This is the message I get:

Error in predictParCheck(object, X) : 
  The coefficient names for the provided model do not correspond to variables in "newdata".

Expect columns:
    price, typeGala, typeHoneycrisp, freshnessAverage, freshnessExcellent

Encoded column names from provided `newdata` object:
    price, typeGala, freshnessAverage, freshnessExcellent

If you have a factor variable in "newdata", check that the factor levels match those of the data used to estimate the model.

Perhaps I need to update my error messages as some of this is not quite related. It's the last message that is most related to this situation though. It says to check that the factor levels match those of the data used to estimate the model. This does not appear to be an issue with the package or the model you estimated but rather the data itself. If you read in an external data file that is missing one of the factor levels used to estimate the model, I think that it should actually error here. This makes sense to me as the model is expecting factor levels that are missing.

I don't have an immediate work around for this other than to try to add the missing factor level to the data you read in. Perhaps this SO might be helpful?

erikseulean commented 1 year ago

So my workaround is doable, but incorrect. I'll actually check what lm does for this as I'm curious (but I'm fairly convinced it doesn't error out). From a user stand point it does not feel correct to have to supply all the factor levels, I feel that the error in the code is checking the wrong thing (I think it was rather intended to check the matching of the covariates rather than the levels).

There are scenarios where the levels used to fit the model (present in the train) are not available to the user when the predict is called (one reason that I can think about is when the train data is huge and you want to dispose it from the memory after training, and the predictions are done at a way later point in time).

The other logical question is why do you need to provide factor levels that you don't need ? The predictions can happen without these levels no ?

Another scenario where this would fail is if you save your model to disk and then load it on a different machine where you don't have the data (which is somewhat common if you train and use on different clusters):

saveRDS(mnl_pref, "model.rds")
my_model <- readRDS("model.rds")

Using it on the new cluster would have to ensure the levels, which is again incorrect in my opinion.

From a correctness standpoint I could see two ways of doing this:

The library would check that the factors in the test are a subset of the ones in the train
The library sets the factor levels based on the data that was trained on, and checks that no levels that was not trained on are present.

In my situation I can go by adding the factor levels, however, 6 months down the line I can see how my predictions are based on some streamed input where I don't have access to the data anymore.

Is there a library that you know that works on the principle mentioned above (where you have to alter the test data) as an example ?

If you are willing to accept contributors to your library, and decide on a possible soultion, I am happy to get my hands dirty, read the code and send a PR.

jhelvy commented 1 year ago

All good points here. For sure I should revise the input checking to separate the two different types of issues here (whether there are missing variables, or whether there are missing factor levels for a factor type variable).

Whenever something like this comes up, I try to check how other packages handle the situation. And you're right about not expecting an error. Here's an example for lm(), which doesn't error if a factor level is missing:

library(palmerpenguins)

# Fit a model
fit <- lm(body_mass_g ~ flipper_length_mm + species, data = penguins)

# Check levels for penguins$species
levels(penguins$species)

[1] "Adelie"    "Chinstrap" "Gentoo"   

# Make new dataset for prediction, missing the "Chinstrap" level for species
new <- penguins[c(1:5, 171:175), ]
new$species <- factor(c(rep("Adelie", 5), rep("Gentoo", 5)))

# Check levels for new$species

levels(new$species)

[1] "Adelie" "Gentoo"

# Predict

pred <- predict(fit, new)

There is in fact no error at all with lm() in this situation. So for sure the input check that you identified should keep the check about erroring if a covariate is missing, but it should not error if a factor level is missing.

I expect that fixing this will involve more than just changing the input check functions. How I handle data encoding internally prior to estimating the models probably assumes that all levels are present (that's my guess as to why I have this check here in the first place).

In particular, I expect the recodeData() function may have to be revised. The reason is because it sets a reference level for categorical variables based on the factor levels. This gets called when you use predict() to encode the data prior to making predictions. If a factor level is missing, the function may encode a different level as the reference level. This will cause the resulting dummy-coded coefficients to be mis-aligned with the names of the model parameters.

The factor order matters for categorical variables. See for example this article in the documentation, which shows how you can change the reference level for categorical variables. I want to make sure this functionality is preserved as this is how most R packages work.

A possible work-around is to add some code to internally set the missing factor levels. The data are stored in the model, so for categorical variables we should be able to get their levels (and their order of those levels). If any are missing in the newdata, we should be able to add those additional levels without actually modifying the data (there must be a way to do this). That (I hope) will address the issue.

If you have the time to try to walk through the code, I'm happy to accept a PR. Otherwise I'll have to check this again after finals week (next week) is over (lots and lots of grading to do).

erikseulean commented 1 year ago

This is not a blocking issue for me, so there's no reason to try to fit it anywhere in your schedule! I will find some time to read the library and understand the code (I wanted to do it anyways mostly to get some more insights about the implementation) and see if I can get a potential fix in.

Thanks a lot for taking your time on this!

jhelvy commented 1 year ago

I believe I may have fixed this issue on the latest branch. Can you try installing from that branch to test your data?

remotes::install_github("jhelvy/logitr@Rc-v1.0.0")

Here's an example using the yogurt data:

library(logitr)

model <- logitr(
    data    = yogurt,
    outcome = 'choice',
    obsID   = 'obsID',
    pars    = c('price', 'feat', 'brand')
)

Use the model to make predictions using a subset of the original data:

data <- subset(
    yogurt, obsID %in% c(42, 13),
    select = c('obsID', 'alt', 'price', 'feat', 'brand'))

probs <- predict(model, newdata = data, obsID = "obsID")

Now remove the 4th alternative from the data (eliminating the "yoplait" level from brand) and predict again:

data2 <- data[which(data$alt != 4),]
probs <- predict(model, newdata = data2, obsID = "obsID")

No errors (hopefully)

erikseulean commented 1 year ago

Sorry for coming back so late. This seems to be working now! Thanks a lot for looking into it.

jhelvy commented 1 year ago

Okay great thanks for checking it out! The changed were merged into the latest version, 1.0.0.

jhelvy / logitr

Predicted names don't match #41