Compute correct log_likelihood for models with partially observed variables

ricardoV94 commented 2 years ago

Masked observations are no longer associated with the observed RV, but to a separate free RV.

import numpy as np
import pymc as pm

with pm.Model() as m:
  x = pm.Normal('x', observed=[np.nan, 1, 2, 3])

print(m['x_observed'].tag.observations)  # TensorConstant{[1.0 2.0 3.0]}

As such I am not sure whether we are computing the correcting model log_likelihood for partially observed models, assuming the imputed observations should appear in the final log_likelihood:

Running all tests in test_idata_conversion.py with coverage, confirmed that these lines lines in InferenceDataConverter.log_likelihood-vals_point are not being triggered:

https://github.com/pymc-devs/pymc/blob/0c90e8204fcf4040edca62a490e2f5aee982f96a/pymc/backends/arviz.py#L248-L257

This came up in #5245

ricardoV94 commented 2 years ago

Anyone has a clear source (or knowledge) of whether imputed variables should be considered during model comparison (and as such be part of the InferenceData log-likelihood field)?

In V3 they were, now they are not.

twiecki commented 2 years ago

CC @junpenglao @fonnesbeck @ColCarroll

OriolAbril commented 2 years ago

My hunch is that neither option is correct, but with loo it's relatively easy to check it out by comparing psis-loo results (via az.loo) to "brute force" cross validation (which can be partly automated with az.reloo).

I lean towards neither because both waic and loo are designed for model comparison based on predictive accuracy but when performing inputation, part of the prediction task is "included" in the model and affects the logp. IMO using waic/loo for model *-comparison with missing data should be fit the model on the observed data, then loo/waic are an estimation of how well the model should be able to predict the missing data and you can use that to to compare models. The first and most important question when using loo/waic is defining what is actually the "predictive task".

Disclaimer: I might be a bit biased against imputation. From what I said above, one could also argue that the model with data imputation is taking into account some extra information (as if one used better priors) and therefore both could be correct depending on the goal.

What would make some more sense to me would be to exclude the missing/imputed data from the pointwise log likelihood data (but not from the fit) and then compare the model that excludes the missing data from the fit with the model with imputed data. Now both would be using the same data to estimate their predictive accuracy and can be compared.

Similarly I guess you could include imputed data there to compare data imputation approaches. But note that here you are not estimating the predictive accuracy on missing data using the cv approach/assumption (i.e. how well we fit observed data is a valid measure of how well we'll fit unobserved data) but by evaluating the likelihood on the imputation and assuming it's observed. So if the goal is actually to estimate this data, using it to calculate loo/waic might not be the best measure and it might be preferable to simply use the log predictive density on imputed data only to compare different imputation approaches?

fonnesbeck commented 2 years ago

There is nothing special about missing data as far as Bayesian inference is concerned. They are just latent variables, and should be treated as such during model comparison. Most of the time, the missing data will be at terminal nodes of the model graph, with nothing downstream of them, meaning they should not have much influence on model comparison. However, occasionally they are deeper in the model (e.g. imputing missing inputs in a regression). These have to be included in model comparison, because you'd want the model that imputes values that result in better estimates downstream. I can't think of any principled reason for excluding imputed data nodes, if they are part of the model.

ricardoV94 commented 2 years ago

@fonnesbeck if I understand you correctly that's(accidentally) the current behavior of V4. We should then add a test to avoid regressions?

pymc-devs / pymc

Compute correct log_likelihood for models with partially observed variables #5255