pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.63k stars 1.99k forks source link

QUESTION: Shape of the response variable is checked when getting predictions #6417

Closed tomicapretto closed 1 year ago

tomicapretto commented 1 year ago

Describe the issue:

I'm using pm.sample_posterior_predictive() with predictions=True to get draws from the posterior predictive distribution for out-of-sample data and it's not working. From what I understand from the traceback, there's a shape mismatch between the data for the predictors and the response. The problem is the responses are not observed for out-of-sample data, so it does not make sense to compare them.

Reproduceable code example:

import numpy as np
import pandas as pd
import pymc as pm
import pytensor as pt

rng = np.random.default_rng(1234)
x = rng.normal(size=50)
g = rng.choice(list("abcd"), size=50)
y = rng.normal(size=50)

groups, groups_idx = np.unique(g, return_inverse=True)
coords = {"groups": groups}

with pm.Model(coords=coords) as model:
    x_ = pm.MutableData("x", x)
    idxs = pm.MutableData("groups_idx", groups_idx)

    b0 = pm.Normal("b0", dims="groups")
    b1 = pm.Normal("b1", dims="groups")
    sigma = pm.HalfNormal("sigma")
    mu = b0[idxs] + b1[idxs] * x_
    pm.Normal("response", mu=mu, sigma=sigma, observed=y)

    idata = pm.sample(random_seed=1234)

with model:
    pm.set_data(
        {
            "x": np.random.normal(size=4),
            "groups_idx": np.array([0, 0, 1, 1]),
        }
    )

with model:
    pm.sample_posterior_predictive(
        idata,
        predictions=True,
        extend_inferencedata=True,
        random_seed=1234,
    )

Error message:

```shell Output exceeds the size limit. Open the full output data in a text editor --------------------------------------------------------------------------- ValueError Traceback (most recent call last) File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/compile/function/types.py:972, in Function.__call__(self, *args, **kwargs) 970 try: 971 outputs = ( --> 972 self.vm() 973 if output_subset is None 974 else self.vm(output_subset=output_subset) 975 ) 976 except Exception: File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/graph/op.py:542, in Op.make_py_thunk..rval(p, i, o, n, params) 538 @is_thunk_type 539 def rval( 540 p=p, i=node_input_storage, o=node_output_storage, n=node, params=None 541 ): --> 542 r = p(n, [x[0] for x in i], o) 543 for o in node.outputs: File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/tensor/random/op.py:379, in RandomVariable.perform(self, node, inputs, outputs) 377 rng_var_out[0] = rng --> 379 smpl_val = self.rng_fn(rng, *(args + [size])) 381 if ( 382 not isinstance(smpl_val, np.ndarray) 383 or str(smpl_val.dtype) != out_var.type.dtype 384 ): File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/tensor/random/op.py:164, in RandomVariable.rng_fn(self, rng, *args, **kwargs) 163 """Sample a numeric random variate.""" --> 164 return getattr(rng, self.name)(*args, **kwargs) File _generator.pyx:1133, in numpy.random._generator.Generator.normal() File _common.pyx:579, in numpy.random._common.cont() File _common.pyx:496, in numpy.random._common.cont_broadcast_2() File __init__.pxd:742, in numpy.PyArray_MultiIterNew3() ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (50,) and arg 1 with shape (4,). During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) Cell In[12], line 2 1 with model: ----> 2 pm.sample_posterior_predictive( 3 idata, 4 predictions=True, 5 extend_inferencedata=True, 6 random_seed=1234, 7 ) File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pymc/sampling/forward.py:644, in sample_posterior_predictive(trace, model, var_names, sample_dims, random_seed, progressbar, return_inferencedata, extend_inferencedata, predictions, idata_kwargs, compile_kwargs) 639 # there's only a single chain, but the index might hit it multiple times if 640 # the number of indices is greater than the length of the trace. 641 else: 642 param = _trace[idx % len_trace] --> 644 values = sampler_fn(**param) 646 for k, v in zip(vars_, values): 647 ppc_trace_t.insert(k.name, v, idx) File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pymc/util.py:390, in point_wrapper..wrapped(**kwargs) 388 def wrapped(**kwargs): 389 input_point = {k: v for k, v in kwargs.items() if k in ins} --> 390 return core_function(**input_point) File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/compile/function/types.py:985, in Function.__call__(self, *args, **kwargs) 983 if hasattr(self.vm, "thunks"): 984 thunk = self.vm.thunks[self.vm.position_of_error] --> 985 raise_with_op( 986 self.maker.fgraph, 987 node=self.vm.nodes[self.vm.position_of_error], 988 thunk=thunk, 989 storage_map=getattr(self.vm, "storage_map", None), 990 ) 991 else: 992 # old-style linkers raise their own exceptions 993 raise File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/link/utils.py:536, in raise_with_op(fgraph, node, thunk, exc_info, storage_map) 531 warnings.warn( 532 f"{exc_type} error does not allow us to add an extra error message" 533 ) 534 # Some exception need extra parameter in inputs. So forget the 535 # extra long error message in that case. --> 536 raise exc_value.with_traceback(exc_trace) File ~/anaconda3/envs/ib_advanced_regression/lib/python3.10/site-packages/pytensor/compile/function/types.py:972, in Function.__call__(self, *args, **kwargs) 969 t0_fn = time.perf_counter() 970 try: 971 outputs = ( --> 972 self.vm() 973 if output_subset is None 974 else self.vm(output_subset=output_subset) ... Inputs values: [Generator(PCG64) at 0x7F6813D0E260, array([50]), array(11), array([ 0.42513669, 0.53299287, -0.34085399, -0.20724831]), array(1.1475582)] Outputs clients: [['output'], ['output']] HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'. HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node. ```

PyMC version information:

pt.__version__, pm.__version__
('2.8.11', '5.0.1')

It worked until PyMC 4.1.7

ricardoV94 commented 1 year ago

Try

pm.Normal("response", mu=mu, sigma=sigma, observed=y, shape=mu.shape)
tomicapretto commented 1 year ago

Now it works! But is this the way one expects to use this from now?

ricardoV94 commented 1 year ago

Yes, if you don't specify shape or dims explicitly it defaults to shape=observed.shape, in which case you would need to update observed to a dummy value with the correct shape.

It's mentioned in the docstring examples of set_data

tomicapretto commented 1 year ago

I think I understand the underlying reason, but I still think it's a little weird to behave like that at a high level.

When you compute predictions for new observations (there's a new set of predictor values), and given the posterior draws, the original observed values do not play any role in the prediction.

Is it expected this feature to keep the same behavior in the future? If that's the case, we could close this issue and I'll just adapt my code so it works.

ricardoV94 commented 1 year ago

Yes I think this will likely stay.

It's not that we are explicitly checking the shape, it's that the model variable that is first created is something like pm.Normal.dist(mu, shape=(10,)) where mu has shape==(10,). Later mu is changed to have shape==(4,), but nothing is done about the shape of the distribution, which results in an invalid graph.

You must tell PyMC that the shape is tied to the shape of the mu parameter. We can't always assume this because the following is also valid:

with pm.Model() as m:
  ...
  pm.Normal("llike", mu=[0], observed=[0, 0, 0])

The only safe assumption when nothing is told about shape and dims is that the shape of an observed variable must match the shape of its observations at creation time.

tomicapretto commented 1 year ago

Thanks @ricardoV94 !!