jknowles / merTools

Convenience functions for working with merMod objects from lme4
105 stars 15 forks source link

Nonreproducible predictInteval results mixing observed and unobserved levels. #124

Open dotPiano opened 2 years ago

dotPiano commented 2 years ago

Hi,

I have trained a LMM of the form y ~ 1 + (1|a) + (1|a:b) and now want to predict y for different inputs. All the inputs are stored in a single dataframe, which I pass to predictInterval. In my query dataframe all a are present in the training dataset, while a:b are only sometimes present.

I noticed that multiple calls to predictInterval return significantly different (more than simple sampling noise) intervals in some cases. Surprisingly, the intervals predicted calling predictInterval on the entire dataframe and on each row individually are consistently different.

I tried to debug the code of the function, and I believe that the problem lies here: https://github.com/jknowles/merTools/blob/master/R/merPredict.R#L246

 tmp$var <- names(tmp[keep])[max.col(tmp[keep])] 

With my input, tmp[keep] looks like this:

                       a1:b1                     a2:b1
2                          1                         0
3                          0                         0
4                          0                         1

Queries 2 and 4 have a corresponding levels in the training set, while query 3 doesn't. I would now expect tmp$var to take the values a1:b1, NA, a2:b1, but instead it takes either a1:b1, a1:b1, a2:b1 or a1:b1, a2:b1, a2:b1 at random. There are two problems here:

  1. max.col will return a value (and thus select a column/level) even if the entire row in zero (instead of returning NA).
  2. Since a zero row is a tie, max.col by default selects a column at random.

I'm relatively new to R, and even more to LMMs, but my understanding is that when a level is missing, predictInterval will compute the random effect for another level selected at random from the (observed) levels of other queries. A solution that seems to work for me is to change the guilty line to:

tmp2 <- tmp[keep]
tmp2[!as.logical(rowSums(tmp2 != 0)),] <- NA
tmp$var <- names(tmp[keep])[max.col(tmp2)]

Quite dirty, but the idea is to fill the all-zeros rows with NAs, so that max.col will return NA for them.

Unfortunately I cannot share my actual model/dataset at the moment, but if the issue is unclear I'm happy to try to build a small example to reproduce the problem.

jknowles commented 2 years ago

I think this is the expected behavior of predictInterval. When you request intervals for a level not present in the model for random effects, the interval covers sampling across the random effect levels. This is a conservative default because if the random effect is not observed, our safest assumption is to assume it is drawn from the same distribution of random effects, not to assume that it is the value of the mean/median effect in the distribution.

I admit that both our documentation and vignette are a bit out of date on this one, for example the vignette mistakenly states:

The prediction data set can include levels that are not in the estimation model frame. The prediction intervals for such observations only incorporate uncertainty from fixed coefficient estimates and the residual level of variation.

It looks like you'd prefer to return an NA for these observations? Or would you prefer it behave as the vignette describes?

dotPiano commented 2 years ago

I changed the title to reflect the main issue, which is the prediction of different intervals with multiple calls for the same query.

In my case I want to use the random effects to represent nesting, for example y ~ 1 + (1|class/pupil). Again, I'm new to LMMs, but my assumption is that if a pupil is not observed we can just drop (1|class:pupil) and look at 1 + (1|class) which should represent the population from which pupil is drawn. This is actually the current behavior if the query contains a single row. Does this make sense?

Regarding NAs: I expected the lines above to produce NAs because this leads to the random effect being dropped later in tmp.pred(). I still would like to have predictInterval() return an estimate even if a particular level is not observed, but I think it's important that this result is not based on randomly selected levels (both for scientific reproducibility and because this would not be an actual prediction interval).

jknowles commented 2 years ago

On the technical side I need to look more into what predictInterval is doing in the nested case very specifically before I decide what action, if any to take. It sounds like the function may be returning inconsistent results for all cases when some unobserved cases are present, which would certainly be bad! Setting the seed should handle the reproducibility problem if I have implemented the default correctly, but it is something I need to look into.

Do you happen to have a reproducible example I can execute that illustrates this?

but my assumption is that if a pupil is not observed we can just drop (1|class:pupil) and look at 1 + (1|class) which should represent the population from which pupil is drawn. This is actually the current behavior if the query contains a single row. Does this make sense?

Conceptually here you are right in thinking about the fitted value for the prediction. The mean/median pupil effect will be 0, so we could just rely on class to get the fitted value and we would be right for most pupils. The problem is the interval component. The purpose of the prediction interval is to tell us what range of values the model would see as plausible for a given set of inputs. When we are making predictions for a level we do not observe in the model, this question becomes downright philosophical.

Your approach would exclude two types of variation in the outcome that we know about and can define from the model: the variation from the mean of the effect of a given student (which can be estimated by looking at the distribution the random pupil effects are drawn from), and you would also be excluding the variance associated with the precision with which we can estimate the effect of any given pupil. This would make the prediction interval overly confident in terms of what our model can know about the student's likely outcome. By sampling across the distribution of plausible values the model has already seen, we're saying we believe the pupil was probably drawn from the same distribution as the pupils in the model, but capturing that we don't know more than that about them and that our prediction should reflect that uncertainty.

This is why I prefer the default to be for each simulation to sample across the distribution for the unobserved pupil and draw an interval encompassing the potential effect of that pupil and the variance in our precision of estimating the pupil effect. Whether that default is a) documented correctly, or b) functioning in all edge cases correctly is something you've helped raise as something I should look into.

However, in some cases, I can see where this default is undesirable. In my own work I have dealt with this by fixing unobserved groups to the group closest to the mean when I define newdata and indicating they were unobserved through an indicator variable.

dotPiano commented 2 years ago

Here is an example. For clarity: grades come from different tests that each student took during the year, and for whatever reason some grades/pupils have been "lost":

library("lme4")
library("merTools")

training_df <- data.frame(
  class = c("c1", "c1", "c1", "c1", "c1", "c1", "c1", "c1", "c2"),
  pupil = c("p1", "p1", "p1", "p2", "p3", "p3", "p4", "p5", "p7"),
  grade = c(4.5,  4.8,  4.7,  5.0,  4.1,  4.0,  3.5,  5.5,  5.5)
)

query_df <- data.frame(
  class = c("c1", "c1", "c1"),
  pupil = c("p1", "p2", "p6")
)

model <- lmer("grade ~ 1 + (1|class/pupil)", data=training_df)
print(predictInterval(model, query_df, seed=1, n.sims=10000))
print(predictInterval(model, query_df, seed=2, n.sims=10000))

In the output I see that p6, whose grades are completely missing, get different predictions depending on the seed. Additionally, if I ask for the grade of p6 alone, I get yet another estimate, which is always the same regardless of the seed:

       fit      upr      lwr
1 4.656496 5.386695 3.930568
2 4.995546 5.717711 4.246008
3 4.989409 5.715988 4.246086
       fit      upr      lwr
1 4.673843 5.392574 3.959413
2 4.989406 5.721927 4.262704
3 4.666336 5.390428 3.955993
       fit      upr      lwr
1 4.677724 5.341266 4.008364
Warning message: The following levels of pupil:class from newdata -- NA -- are not in the model data. [...]

By sampling across the distribution of plausible values the model has already seen, we're saying we believe the pupil was probably drawn from the same distribution as the pupils in the model, but capturing that we don't know more than that about them and that our prediction should reflect that uncertainty.

Maybe I start to understand the issue. Now, regarding the two sources variation that you mention:

the variation from the mean of the effect of a given student

The mean effect should be drawn from the distribution of the class. If (1|class:pupil) is dropped, isn't this captured by the residual variance of 1 + (1|class)?

the variance associated with the precision with which we can estimate the effect of any given pupil

I don't think that selecting a level at random is a good way of capturing this. What about introducing an option to specify assumptions? For example the user could assume that the variance is the same for all pupils, so that it can be estimated from the observed pupils. This may not make much sense in this specific example, but would be great if one wants to model experimental measurements taken with the same device/technique (which is actually my case). Alternatively, one could conservatively assume that this variance is at most as large as the largest observed variance. Complicating things even more, one could model the distribution of the pupil variances, and sample from there... just brainstorming a bit here.

I understand this is a difficult problem and I really appreciate your help!

dotPiano commented 2 years ago

I made quite a bit of confusion with the comments about the sources of variation.

Using the same example and assuming that tests are noisy observations of the skills of a student and we are interested in predicting the "true grade" of a student (not what the grade of a future test could be). One would then call predictInterval(..., include.resid.var=FALSE), right? In that case my argument of using the residuals does not apply and we really need to estimate the effect of an unknown pupil.

Trying to iterate on your current implementation: currently when a new pupil appears, the function would pick an observed one at random and sample n.sims model parameters for it. What about sampling n.sims pupils and one set of model parameters for each of them instead? I would also argue that it's better (or should be possible to chose) to sample pupils from the training data rather than from the query, as the training set is our best knowledge of the possible pupil effects.

jknowles commented 2 years ago

Thanks for bringing this up.

I will write some test conditions and get predictInterval to behave the way I think it should behave and align the documentation to reflect that clearly. From that default almost any alternative ways one may wish to prefer to treat the variation of unobserved levels can be handled gracefully using the other package functionality like wiggle and marginalizeRE which I will write up a brief use case to describe.

Thanks for helping me clearly seen an important defect in how we are communicating what predictInterval() is doing in the important case of producing predictions out-of-sample. It will be really helpful to make the default crystal clear, and provide end-users alternatives should they wish to obtain a different type of estimate.

dotPiano commented 2 years ago

That sounds very helpful, thanks a lot!