Open kamicollo opened 1 year ago
In light of https://github.com/pymc-devs/pytensor/issues/258, it seems that MaskedArrays can't be currently supported via PyMC Data (Mutable or Constant), but arrays simply containing nan
should be. In this case, we could at least support the following example:
import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
with pm.Model() as m:
β = pm.Normal("β", 0, 1)
σ = pm.Exponential("σ", 1)
X = pm.Normal("X", 0, 1, observed=pm.ConstantData("X_with_nans", X))
pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)
m.compile_logp()(m.initial_point()) # array(nan)
By introspecting the values of X_with_nans
@ricardoV94 - it looks like your previous message got cut off. Based on what you explained in https://github.com/pymc-devs/pytensor/issues/258, this is a bit more complicated than I thought - appreciate you explaining it.
At the same time, the automatic imputation of missing values is quite a core concept to Bayesian workflows, and currently pyMC's support to that is a bit awkward. On the one hand, it's possible to leverage it by passing masked arrays directly to observed
because the automatic imputation logic under Model.make_obs_var()
relies on masking attribute to split the RV into observed and missing portions. On the other hand, all workflows that want to get predictions or otherwise "swap" data need to rely on pm.Data
- so it's impossible to work both with automatic imputation & more modular code.
I am not familiar enough with the pyMC / pytensor internals to have a big picture, but could an interim solution be:
1) If data passed to pm.Data() is a masked array, convert it to an array with nan
values before passing to pytensor, this way ensuring that masked values do not get unmasked downstream.
2) Update the code under Model.make_obs_var()
that splits the RV into observed/missing portions to do the split based on nan
values (or mask, if one exists).
In that case, automatic imputation would still be backward compatible while anyone who passes nan/masked arrays to pm.Data()
and then passes it on to observed
could also benefit from automatic imputation.
Appreciate it may break some implicit assumptions elsewhere, though.
@kamicollo that seems reasonable. Would you mind opening a PR for that?
Yes, I can have a go at that - may revert if I run into issues, as I see the current code relies a lot on subtensors, and I may need some help to figure out how exactly to leverage them.
Describe the issue:
Automatic imputation fails silently in pyMC if a user passes partially observed data held in
pm.ConstantData()
orpm.MutableData()
toobserved
parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).Based on my initial review of the source code, it seems the culprit is
Model.make_obs_var()
method, where the check whether passed data is performed withmask = getattr(data, "mask", None)
, which always returns None for tensors.In case of
pm.ConstantData()
, the fix appears to be quite simple (need to retrieve masked values bymask = getattr(data.value, "mask", None)
instead. In case ofpm.MutableData()
, however, the issue seems to be thatpytensor.shared()
does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and notnp.nan
. I'll file an issue under pytensor project about this, too.I'd be happy to contribute a PR for
pm.ConstantData()
fix + possibly aNotImplemented
error forpm.MutableData()
if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!Reproduceable code example:
Error message:
No response
PyMC version information:
pymc 5.1.2 pytensor 2.10.1
Context for the issue:
The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.