pymc-devs / pymc-bart

https://www.pymc.io/projects/bart
Other
90 stars 18 forks source link

BART model save, reload and new predictions #123

Open twj8CDC opened 1 year ago

twj8CDC commented 1 year ago

Hi, I have been trying to save, reload and generate new predictions with a model that includes a BARTRV.

I am able to save the trace as a pickle (net_cdf works too), and then instantiate a new model and get the posterior predictions on the training data, but when I try to add new data I get shape errors. The shape errors are odd since when I train the model I can update the model with new data for predictions without any issues. It is only when I use the newly instantiated model that I am unable to update the input data.

Below is a minimal example:

from pathlib import Path

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb

import cloudpickle as cpkl
import dill 
print(f"Running on PyMC v{pm.__version__}")

print(f"Running on PyMC-BART v{pmb.__version__}")
try:
    bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
    bikes = pd.read_csv(pm.get_data("bikes.csv"))

features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]

xt = X[0:10]
yt = Y[0:10]
with pm.Model() as model_bikes:
    xdata = pm.MutableData("xdata", X)
    a = pm.Exponential("a", 1)
    mu_ = pmb.BART("mu_", xdata, np.log(Y), m=20)
    mu = pm.Deterministic("mu", pm.math.exp(mu_))
    y = pm.NegativeBinomial("y", mu=mu, alpha=a, observed=Y, shape=xdata.shape[0])
    idata_bikes = pm.sample(random_seed=99, draws=100, tune=100, compute_convergence_checks=False)
idata_bikes

Pickle instead of netcdf, but this seems to work fine

# # pickle
with open('test4.pkl', mode='wb') as file:
    cpkl.dump(idata_bikes, file)

with open("test4.pkl", mode="rb") as file:
    idata4 = cpkl.load(file)

Posterior predictions on updated data works with OG model with the OG idata and the saved and loaded idata

with model_bikes:
    pm.set_data({"xdata": xt})
    post1 = pm.sample_posterior_predictive(idata_bikes, var_names=["mu", "y"])

with model_bikes:
    pm.set_data({"xdata": xt})
    post2 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])
print(post1.posterior_predictive["mu"].values.mean((0,1)))
print(post2.posterior_predictive["mu"].values.mean((0,1)))

Restart the session to test the load from a clean slate and reload the data from above

from pathlib import Path

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb

import cloudpickle as cpkl
import dill 
print(f"Running on PyMC v{pm.__version__}")

print(f"Running on PyMC-BART v{pmb.__version__}")
try:
    bikes = pd.read_csv(Path("..", "data", "bikes.csv"))
except FileNotFoundError:
    bikes = pd.read_csv(pm.get_data("bikes.csv"))

features = ["hour", "temperature", "humidity", "workingday"]

X = bikes[features]
Y = bikes["count"]

xt = X[0:10]
yt = Y[0:10]

Specify the new model. Only difference is variable names

with pm.Model() as model2:
    xdata2 = pm.MutableData("xdata", X)
    a2 = pm.Exponential("a", 1)
    mu_2 = pmb.BART("mu_", xdata2, np.log(Y), m=50)
    mu2 = pm.Deterministic("mu", pm.math.exp(mu_2))
    y2 = pm.NegativeBinomial("y", mu=mu2, alpha=a2, observed=Y, shape=xdata2.shape[0])

load the saved idata

with open("test4.pkl", mode="rb") as file:
    idata4 = cpkl.load(file)

get posterior predictions on the training data

with model2:
    post5 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"], )

This works minus a slight difference in predicted values, possible due to a difference in random state? The post5 compares well to the post1 and post2 above.

get the poster predictions with new data

with model2:
    pm.set_data({"xdata": xt})
    post4 = pm.sample_posterior_predictive(idata4, var_names=["mu", "y"])

This fails with the following error

"name": "ValueError",
    "message": "size does not match the broadcast shape of the parameters. (10,), (10,), (348,)\nApply node that caused the error: nbinom_rv{0, (0, 0), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F1CCA3F2A40>), MakeVector{dtype='int64'}.0, 4, a, Composite{...}.1)\nToposort index: 5\nInputs types: [RandomGeneratorType, TensorType(int64, shape=(1,)), TensorType(int64, shape=()), TensorType(float64, shape=()), TensorType(float64, shape=(None,))]\nInputs shapes: ['No shapes', (1,), (), (), (348,)]\nInputs strides: ['No strides', (8,), (), (), (8,)]\nInputs values: [Generator(PCG64) at 0x7F1CCA3F2A40, array([10]), array(4), array(1.50162583), 'not shown']\nOutputs clients: [['output'], ['output']]\n\nHINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.\nHINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.",
...

I can't figure out where this shape error arises from. The trained model specified in the top allows for updating of data without issues, so I am not sure if there is a general issue with the way the model is specified?

Is there a different process to saving and reloading a model with a BARTRV?

I also tried pickling the whole model, but that doesn't work because of the multiprocessing components in the BART object. I get a socket error when trying to reload the pickled object.

I have also posted this in the discourse, as I was not sure where it makes the most sense to discuss this issue.

Thanks!

twj8CDC commented 1 year ago

Hi, just to follow-up I think I found a work around to this. I was able to convert to list and pickle the all_trees object from the RV. This object does not have any issues w/ multiprocessing/connection that I see with the full BARTRV or the Model when pickled and loaded. Then I can use the pmb.utilites._sample_posterior(...) function to get draws from the tree structure, which as it appears is how the pdp utility function works.

all_trees = list(model_bikes.mu_.owner.op.all_trees)
with open('test_all_tree.pkl', mode='wb') as file:
   cpkl.dump(all_trees, file)

Restart session and reload modules

with open('test_all_tree.pkl', mode='rb') as file:
   all_tree2 = cpkl.load(file)

rng = np.random.default_rng()
xt_2 = pt.constant(xt)
smpl = pmbu._sample_posterior(all_tree2, xt_2, rng, size = 400, shape=1)
mu_smpl = smpl.mean(0)
ex_mu = pm.math.exp(mu_smpl).eval()

ex_mu appears to appropriately match the pm.sample_posterior_distribution(...) from the trained model and with the updated covariate matrix.

For my actual project, I only care about the mu value output from the BARTRV, so I think that is appropriate method to predict mu with a new covariate matrix. If I needed the y value I believe it could be extended with a sample from whatever the final distribution is w/ the mu value inputed, as I have seen in other examples.

If anyone can confirm that this is an appropriate way to save/load/predict from the BARTRV or provide me with an alternative method that would be great!

Thanks!

aloctavodia commented 1 year ago

Sorry for the late reply, I was very busy with other stuff, still are actually so I am just answering to tell you that the workaround seems fine to me. I will try next week to inspect this issue with more time and see if I can provide a better answer or a more general solution.

twj8CDC commented 1 year ago

Great! Thank you!

edoardoscarpaci commented 9 months ago

Are there any update? I am using the library with my team and we are having a problem to load the model. We don't know how to reinsert the trees into the model, we tried with model.owner.op.all_tress and BARTRV.all_trees. The solutions of @twj8CDC is not applicable to us, because we are using other distribution on top of bart.

Thanks

jfhawkin commented 4 months ago

I have a similar issue. If I want to generate partial dependence plots (or anything else that relies on the all_trees object), I get an error ValueError: high <= 0 because it isn't storing the size of all_trees. I can confirm this is the issue because I ran the test model from the PyMC BART Github repo in the same notebook without any trouble since the model object is generated in the same notebook (whereas I am loading an nc trace file in my case).

jdtuck commented 1 month ago

I am having the same issue as above, any solution would be appreciated as we are building on top of the bart model as well.