BUG: automatic data imputation does not work when observed=pm.Data() tensors

kamicollo commented 1 year ago

Describe the issue:

Automatic imputation fails silently in pyMC if a user passes partially observed data held in pm.ConstantData() or pm.MutableData() to observed parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).

Based on my initial review of the source code, it seems the culprit is Model.make_obs_var() method, where the check whether passed data is performed with mask = getattr(data, "mask", None), which always returns None for tensors.

In case of pm.ConstantData(), the fix appears to be quite simple (need to retrieve masked values by mask = getattr(data.value, "mask", None) instead. In case of pm.MutableData(), however, the issue seems to be that pytensor.shared() does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and not np.nan. I'll file an issue under pytensor project about this, too.

I'd be happy to contribute a PR for pm.ConstantData() fix + possibly a NotImplemented error for pm.MutableData() if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!

Reproduceable code example:

import pymc as pm
import numpy as np

real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    # X = pm.Normal("X", 0, 1, observed = masked_X, dims='test')   

    #T his even fails to sample 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    

    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

    pm.sample()

Error message:

No response

PyMC version information:

pymc 5.1.2 pytensor 2.10.1

Context for the issue:

The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.

ricardoV94 commented 1 year ago

In light of https://github.com/pymc-devs/pytensor/issues/258, it seems that MaskedArrays can't be currently supported via PyMC Data (Mutable or Constant), but arrays simply containing nan should be. In this case, we could at least support the following example:

import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan

with pm.Model() as m:
    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)
    X = pm.Normal("X", 0, 1, observed=pm.ConstantData("X_with_nans", X))
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)

m.compile_logp()(m.initial_point())  # array(nan)

By introspecting the values of X_with_nans

kamicollo commented 1 year ago

@ricardoV94 - it looks like your previous message got cut off. Based on what you explained in https://github.com/pymc-devs/pytensor/issues/258, this is a bit more complicated than I thought - appreciate you explaining it.

At the same time, the automatic imputation of missing values is quite a core concept to Bayesian workflows, and currently pyMC's support to that is a bit awkward. On the one hand, it's possible to leverage it by passing masked arrays directly to observed because the automatic imputation logic under Model.make_obs_var() relies on masking attribute to split the RV into observed and missing portions. On the other hand, all workflows that want to get predictions or otherwise "swap" data need to rely on pm.Data - so it's impossible to work both with automatic imputation & more modular code.

I am not familiar enough with the pyMC / pytensor internals to have a big picture, but could an interim solution be: 1) If data passed to pm.Data() is a masked array, convert it to an array with nan values before passing to pytensor, this way ensuring that masked values do not get unmasked downstream. 2) Update the code under Model.make_obs_var() that splits the RV into observed/missing portions to do the split based on nan values (or mask, if one exists).

In that case, automatic imputation would still be backward compatible while anyone who passes nan/masked arrays to pm.Data() and then passes it on to observed could also benefit from automatic imputation.

Appreciate it may break some implicit assumptions elsewhere, though.

ricardoV94 commented 1 year ago

@kamicollo that seems reasonable. Would you mind opening a PR for that?

kamicollo commented 1 year ago

Yes, I can have a go at that - may revert if I run into issues, as I see the current code relies a lot on subtensors, and I may need some help to figure out how exactly to leverage them.

pymc-devs / pymc