QUESTION: Shape of the response variable is checked when getting predictions #6417

tomicapretto commented 1 year ago

Describe the issue:

I'm using pm.sample_posterior_predictive() with predictions=True to get draws from the posterior predictive distribution for out-of-sample data and it's not working. From what I understand from the traceback, there's a shape mismatch between the data for the predictors and the response. The problem is the responses are not observed for out-of-sample data, so it does not make sense to compare them.

Reproduceable code example:

import numpy as np
import pandas as pd
import pymc as pm
import pytensor as pt

rng = np.random.default_rng(1234)
x = rng.normal(size=50)
g = rng.choice(list("abcd"), size=50)
y = rng.normal(size=50)

groups, groups_idx = np.unique(g, return_inverse=True)
coords = {"groups": groups}

with pm.Model(coords=coords) as model:
    x_ = pm.MutableData("x", x)
    idxs = pm.MutableData("groups_idx", groups_idx)

    b0 = pm.Normal("b0", dims="groups")
    b1 = pm.Normal("b1", dims="groups")
    sigma = pm.HalfNormal("sigma")
    mu = b0[idxs] + b1[idxs] * x_
    pm.Normal("response", mu=mu, sigma=sigma, observed=y)

    idata = pm.sample(random_seed=1234)

with model:
            "x": np.random.normal(size=4),
            "groups_idx": np.array([0, 0, 1, 1]),

with model:

Error message:

PyMC version information:

pt.__version__, pm.__version__
('2.8.11', '5.0.1')

It worked until PyMC 4.1.7

ricardoV94 commented 1 year ago


pm.Normal("response", mu=mu, sigma=sigma, observed=y, shape=mu.shape)
tomicapretto commented 1 year ago

Now it works! But is this the way one expects to use this from now?

ricardoV94 commented 1 year ago

Yes, if you don't specify shape or dims explicitly it defaults to shape=observed.shape, in which case you would need to update observed to a dummy value with the correct shape.

It's mentioned in the docstring examples of set_data

tomicapretto commented 1 year ago

I think I understand the underlying reason, but I still think it's a little weird to behave like that at a high level.

When you compute predictions for new observations (there's a new set of predictor values), and given the posterior draws, the original observed values do not play any role in the prediction.

Is it expected this feature to keep the same behavior in the future? If that's the case, we could close this issue and I'll just adapt my code so it works.

ricardoV94 commented 1 year ago

Yes I think this will likely stay.

It's not that we are explicitly checking the shape, it's that the model variable that is first created is something like pm.Normal.dist(mu, shape=(10,)) where mu has shape==(10,). Later mu is changed to have shape==(4,), but nothing is done about the shape of the distribution, which results in an invalid graph.

You must tell PyMC that the shape is tied to the shape of the mu parameter. We can't always assume this because the following is also valid:

with pm.Model() as m:
  pm.Normal("llike", mu=[0], observed=[0, 0, 0])

The only safe assumption when nothing is told about shape and dims is that the shape of an observed variable must match the shape of its observations at creation time.

tomicapretto commented 1 year ago

Thanks @ricardoV94 !!