pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.72k stars 2.01k forks source link

a request for a use case; where pyMC3 makes a system call to an external code for its generative model #1925

Closed iantmiller closed 5 years ago

iantmiller commented 7 years ago

In my research, my generative model is encapsulated into a software code (i do not have access to the source code). This code reads an input file and writes an output file. I have two choices, i can either run a design of experiments with this code, fit a surrogate model, and then encode this surrogate model inside of pymc3's model definition. Conversely, i would like to be able to make a system call to this software code from within pymc3.

In an effort to test if this mechanism is possible, i have attempted to re-work the simplest example from Pymc3 tutorial where the calculation of mu (as a liner regression fit) is conducted in an external function call.

The following code block is a modified version of the first example from pymc3 website. Instead of having mu = alpha + beta[0]X1 + beta[1]X2 inside of the model block, mu is the return value from a sub-routine. This sub-routine takes in the current sampled values for alpha and beta[i], builds an input file, makes a system call to dummy.py, and lastly it reads the output file created by dummy.py and returns the value of mu.

from pymc3 import  *
import numpy as np
import matplotlib.pyplot as plt
import theano
import json
from subprocess import call

# X1, X2, and Y were generated from the code on Pymc3 tutorial website. first example
X1 = [-1.0856306,   0.99734545,  0.2829785,  -1.50629471, -0.57860025,  1.65143654, -2.42667924, -0.42891263,  1.26593626, -0.8667404,  -0.67888615, -0.09470897,   1.49138963, -0.638902,   -0.44398196, -0.43435128,  2.20593008,  2.18678609,   1.0040539,   0.3861864,   0.73736858,  1.49073203, -0.93583387,  1.17582904,  -1.25388067, -0.6377515,   0.9071052,  -1.4286807,  -0.14006872, -0.8617549, -0.25561937, -2.79858911, -1.7715331,  -0.69987723,  0.92746243, -0.17363568,  0.00284592,  0.68822271, -0.87953634,  0.28362732, -0.80536652, -1.72766949,  -0.39089979,  0.57380586,  0.33858905, -0.01183049,  2.39236527,  0.41291216,  0.97873601,  2.23814334, -1.29408532, -1.03878821,  1.74371223, -0.79806274,  0.02968323,  1.06931597,  0.89070639,  1.75488618,  1.49564414,  1.06939267, -0.77270871,  0.79486267,  0.31427199, -1.32626546,  1.41729905,  0.80723653,  0.04549008, -0.23309206, -1.19830114,  0.19952407,  0.46843912, -0.83115498,  1.16220405, -1.09720305, -2.12310035,  1.03972709, -0.40336604, -0.12602959, -0.83751672, -1.60596276,  1.25523737, -0.68886898,  1.66095249,  0.80730819, -0.31475815, -1.0859024,  -0.73246199, -1.21252313,  2.08711336,  0.16444123,  1.15020554, -1.26735205,  0.18103513,  1.17786194, -0.33501076,  1.03111446, -1.08456791, -1.36347154,  0.37940061, -0.37917643];
X2 = [0.12841094, -0.39557759,  0.14245293,  0.51966079, -0.0049252, 0.00682843,  0.0359099,  -0.37239514,  0.08522933, -0.32108195, -0.08553592,  0.24857391, -0.14704339,  0.1002498,   0.20254781,  0.05574817, -0.27418969, -0.06649506,  0.39188227, -0.40500915, -0.0551572,  -0.11042161,  0.02414947,  0.14964312,  0.32173819, -0.05404648,  0.16246827,  0.09994803,  0.09486946, -0.11278479, -0.19946429, -0.22000862, -0.15128744,  0.06433732,  0.15218988,  0.06469377, -0.10979102,  0.36119402,  0.30377312, -0.07080002, -0.16468628,  0.02604299,  0.25345973,  0.066553,    0.11130974, -0.04241602,  0.09125418,  0.30890889, -0.04793376,  0.02866155,  0.0507633,   0.05674507, -0.28237778, -0.37537373, -0.20393101,  0.03358846,  0.11077123, -0.10613491,  0.2754515,  -0.02863519,  0.0040632, -0.03879277,  0.02680536,  0.14089481,  0.13313069, -0.17968459,  0.30473276, -0.21900529,  0.0158454,  -0.05487931, -0.20979834, -0.01502412, -0.14816275,  0.01458145,  0.08061719,  0.29438587,  0.06147684, -0.12224507, -0.07832396,  0.02799562,  0.01869217,  0.29191785,  0.27907059, -0.07178719, -0.10972843, -0.51141092, -0.10978408, -0.19561154, -0.07096489,  0.07831685,  0.03543847, -0.0059936,   0.03991642, -0.02522355,  0.03940379, -0.646211, -0.0538587,  -0.02217014, -0.06825234, -0.04358925];
Y  = np.array([9.38706859e-01,   4.10296149e-01,3.83981292e+00,   1.48115418e+00, 4.02779506e-01,   2.46184530e+00,  -1.42342679e+00,  -1.27520755e+00,   2.38380704e+00,  -3.90761758e-01,   6.86815665e-01,   2.10641559e+00,   1.84890360e+00,  -8.04359754e-01,   3.93284941e-01,   2.31721220e+00,   3.41651416e+00,   3.39016804e+00,   2.22246532e+00,   3.77308673e-01,  3.43806883e-01,   1.66274112e+00,  -1.20663529e-01,   2.18829692e+00,   1.50706675e+00,  -1.19159361e+00,   1.44784359e+00,  -1.55349860e+00,  -1.40248284e-01,  -1.96609652e-02,  -1.35472064e+00,  -1.59474188e+00, -1.39656749e+00,   5.29754386e-01,   2.63051387e+00,   5.53932221e-01,   1.76084808e+00,   2.39686504e+00,   1.47396672e+00,   9.07514885e-01,   7.37921664e-02,  -3.82899347e-01,   1.49271947e+00,   7.65880501e-01,   2.05273917e+00,   5.63172455e-01,   4.25098874e+00,   3.26909416e-02,   3.93785393e-01,   3.67324277e+00,   1.69575050e+00,   9.38133214e-01,   1.35531685e+00,  -2.42854948e+00,   1.26254192e+00,   2.07270390e+00,   2.75833869e+00,   2.60484762e+00,   3.21391580e+00,   4.95643013e+00,   2.31319324e-01,   1.53863552e+00,   1.25983672e+00,  -5.57565140e-01,   3.74025866e+00,   1.00427073e+00,   2.44326467e+00,   5.03997740e-01,   1.06029822e+00,   1.48250538e+00,  -2.69441500e-01,  -1.19520306e+00, 3.20016631e+00, -6.69460228e-01, -2.24215995e+00,   2.10607318e+00,   2.01495136e+00,  -8.51855254e-01,  -8.99821832e-01,  -1.20278122e+00,   1.05077792e+00,  -1.43401688e-01,   1.84052098e+00,   1.16665281e+00,   5.60119574e-02,  -2.04696786e+00,  -1.66062003e+00,   5.51783963e-01,   1.58062230e+00,   1.63826706e+00,   1.16403511e+00,   3.85980814e-01,   2.23665854e+00,   1.23718947e+00,  -1.16021703e+00,   1.11137427e+00,   1.65658589e+00,  -3.20236525e-03,   1.36931418e+00,   1.33161104e+00]);

def proc_test(alpha, beta1, beta2, X1, X2):

    #### write an input file
    params = {};
    params['X1']    = X1;
    params['X2']    = X2;

    # These are the lines of code that I need help with.
    # As written below, objects of the Normal class are being stored into the dictonary.
    # I need to store the current sampled value for each distribution.

    # Do I need to pull these values from a Backend or Trace object?

    params['alpha'] = alpha;
    params['beta1'] = beta1;
    params['beta2'] = beta2;

    ###

    with open("input.txt", 'w') as outfile:
        json.dump(params, outfile)

    #### make a system call
    call ("python dummy.py input.txt");

    #### read the output file and return the value
    mu = np.loadtxt("output.txt");

    return(mu)

# the model
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement

    # Priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10);
    beta1 = Normal('beta1', mu=0, sd=10);
    beta2 = Normal('beta2', mu=0, sd=10);
    sigma = HalfNormal('sigma', sd=1);

    # Expected value of outcome
    mu = proc_test(alpha, beta1, beta2, X1, X2);

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

    # Inference!
    start   = find_MAP(); # Find starting value by optimization
    step    = NUTS(scaling=start); # Instantiate MCMC sampling algorithm
    trace   = sample(2000, step, start=start); # draw 2000 posterior samples using NUTS sampling

#plot results
plt.figure(figsize=(12, 12))
summary(trace)    
traceplot(trace)

The next code block is the file dummy.py. This code reads its input file, calculates mu = alpha + beta[0]X1 + beta[1]X2 and writes mu to an output file.

# file: dummy.py
import sys
import numpy as np
import json

def main(argv):

    # provide an input file from the command line 
    input_file   = sys.argv[1];

    # read in the input file    
    params = {};
    with open(input_file) as json_data:
        params = json.load(json_data)

    X1 = np.asarray(params["X1"], dtype=float);
    X2 = np.asarray(params["X2"], dtype=float);

    # the generative model
    mu = float(params["alpha"]) + float(params["beta1"])*X1 + float(params["beta2"])*X2

    # write the results to a output file    
    np.savetxt("output.txt", mu);

    return

if __name__ == "__main__":
    main(sys.argv[1:])

My issue is that I don't know enough about the core data structures that pymc3 uses. So i don't know how to extract the current iteration's sampled values of alpha, and beta[i], so that these values can be written to an input file.

thanks much, -ian

twiecki commented 7 years ago

@iantmiller https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#code

hvasbath commented 7 years ago

You have to write a theano wrapping operator around the program you want to call. Then you can simply use the random variables as is.

You can find one example of how to do that in my beat repository https://github.com/hvasbath/beat/tree/master/src in theanof.py l.79-166 this operator is then called in models.py l.343 and yields a function that takes the random variables from pymc3 as inputs models.py l.414)

I am sorry for the complexity of the example, but unfortunatley I think it is not possible to make it easier. You dont need to know what goes on in the functions. Just for principle. I also recommend you to read in the theano tutorial section "how to expand theano?"

I hope this helps up to some point ...

Argantonio65 commented 7 years ago

I have exactly the same issue, proprietary software code which I can only instantiate through a command call. I have been trying to get around and here is a working version of your code with a theano op:

from pymc3 import  *
import numpy as np
import matplotlib.pyplot as plt
import theano.tensor as t
import theano
import json
import os
# X1, X2, and Y were generated from the code on Pymc3 tutorial website. first example
X1 = [-1.0856306,   0.99734545,  0.2829785,  -1.50629471, -0.57860025,  1.65143654, -2.42667924, -0.42891263,  1.26593626, -0.8667404,  -0.67888615, -0.09470897,   1.49138963, -0.638902,   -0.44398196, -0.43435128,  2.20593008,  2.18678609,   1.0040539,   0.3861864,   0.73736858,  1.49073203, -0.93583387,  1.17582904,  -1.25388067, -0.6377515,   0.9071052,  -1.4286807,  -0.14006872, -0.8617549, -0.25561937, -2.79858911, -1.7715331,  -0.69987723,  0.92746243, -0.17363568,  0.00284592,  0.68822271, -0.87953634,  0.28362732, -0.80536652, -1.72766949,  -0.39089979,  0.57380586,  0.33858905, -0.01183049,  2.39236527,  0.41291216,  0.97873601,  2.23814334, -1.29408532, -1.03878821,  1.74371223, -0.79806274,  0.02968323,  1.06931597,  0.89070639,  1.75488618,  1.49564414,  1.06939267, -0.77270871,  0.79486267,  0.31427199, -1.32626546,  1.41729905,  0.80723653,  0.04549008, -0.23309206, -1.19830114,  0.19952407,  0.46843912, -0.83115498,  1.16220405, -1.09720305, -2.12310035,  1.03972709, -0.40336604, -0.12602959, -0.83751672, -1.60596276,  1.25523737, -0.68886898,  1.66095249,  0.80730819, -0.31475815, -1.0859024,  -0.73246199, -1.21252313,  2.08711336,  0.16444123,  1.15020554, -1.26735205,  0.18103513,  1.17786194, -0.33501076,  1.03111446, -1.08456791, -1.36347154,  0.37940061, -0.37917643];
X2 = [0.12841094, -0.39557759,  0.14245293,  0.51966079, -0.0049252, 0.00682843,  0.0359099,  -0.37239514,  0.08522933, -0.32108195, -0.08553592,  0.24857391, -0.14704339,  0.1002498,   0.20254781,  0.05574817, -0.27418969, -0.06649506,  0.39188227, -0.40500915, -0.0551572,  -0.11042161,  0.02414947,  0.14964312,  0.32173819, -0.05404648,  0.16246827,  0.09994803,  0.09486946, -0.11278479, -0.19946429, -0.22000862, -0.15128744,  0.06433732,  0.15218988,  0.06469377, -0.10979102,  0.36119402,  0.30377312, -0.07080002, -0.16468628,  0.02604299,  0.25345973,  0.066553,    0.11130974, -0.04241602,  0.09125418,  0.30890889, -0.04793376,  0.02866155,  0.0507633,   0.05674507, -0.28237778, -0.37537373, -0.20393101,  0.03358846,  0.11077123, -0.10613491,  0.2754515,  -0.02863519,  0.0040632, -0.03879277,  0.02680536,  0.14089481,  0.13313069, -0.17968459,  0.30473276, -0.21900529,  0.0158454,  -0.05487931, -0.20979834, -0.01502412, -0.14816275,  0.01458145,  0.08061719,  0.29438587,  0.06147684, -0.12224507, -0.07832396,  0.02799562,  0.01869217,  0.29191785,  0.27907059, -0.07178719, -0.10972843, -0.51141092, -0.10978408, -0.19561154, -0.07096489,  0.07831685,  0.03543847, -0.0059936,   0.03991642, -0.02522355,  0.03940379, -0.646211, -0.0538587,  -0.02217014, -0.06825234, -0.04358925];
Y  = np.array([9.38706859e-01,   4.10296149e-01,3.83981292e+00,   1.48115418e+00, 4.02779506e-01,   2.46184530e+00,  -1.42342679e+00,  -1.27520755e+00,   2.38380704e+00,  -3.90761758e-01,   6.86815665e-01,   2.10641559e+00,   1.84890360e+00,  -8.04359754e-01,   3.93284941e-01,   2.31721220e+00,   3.41651416e+00,   3.39016804e+00,   2.22246532e+00,   3.77308673e-01,  3.43806883e-01,   1.66274112e+00,  -1.20663529e-01,   2.18829692e+00,   1.50706675e+00,  -1.19159361e+00,   1.44784359e+00,  -1.55349860e+00,  -1.40248284e-01,  -1.96609652e-02,  -1.35472064e+00,  -1.59474188e+00, -1.39656749e+00,   5.29754386e-01,   2.63051387e+00,   5.53932221e-01,   1.76084808e+00,   2.39686504e+00,   1.47396672e+00,   9.07514885e-01,   7.37921664e-02,  -3.82899347e-01,   1.49271947e+00,   7.65880501e-01,   2.05273917e+00,   5.63172455e-01,   4.25098874e+00,   3.26909416e-02,   3.93785393e-01,   3.67324277e+00,   1.69575050e+00,   9.38133214e-01,   1.35531685e+00,  -2.42854948e+00,   1.26254192e+00,   2.07270390e+00,   2.75833869e+00,   2.60484762e+00,   3.21391580e+00,   4.95643013e+00,   2.31319324e-01,   1.53863552e+00,   1.25983672e+00,  -5.57565140e-01,   3.74025866e+00,   1.00427073e+00,   2.44326467e+00,   5.03997740e-01,   1.06029822e+00,   1.48250538e+00,  -2.69441500e-01,  -1.19520306e+00, 3.20016631e+00, -6.69460228e-01, -2.24215995e+00,   2.10607318e+00,   2.01495136e+00,  -8.51855254e-01,  -8.99821832e-01,  -1.20278122e+00,   1.05077792e+00,  -1.43401688e-01,   1.84052098e+00,   1.16665281e+00,   5.60119574e-02,  -2.04696786e+00,  -1.66062003e+00,   5.51783963e-01,   1.58062230e+00,   1.63826706e+00,   1.16403511e+00,   3.85980814e-01,   2.23665854e+00,   1.23718947e+00,  -1.16021703e+00,   1.11137427e+00,   1.65658589e+00,  -3.20236525e-03,   1.36931418e+00,   1.33161104e+00]);

i = 0
@theano.compile.ops.as_op(itypes=[t.dscalar,t.dscalar,t.dscalar],otypes=[t.dvector])
def proc_test(alpha, beta1, beta2):
    #### write an input file
    global X1, X2, i

    params = {};
    params['X1']    = X1;
    params['X2']    = X2;

    params['alpha'] = float(alpha);
    params['beta1'] = float(beta1);
    params['beta2'] = float(beta2);

    ###
    with open("input.txt", 'w') as outfile:
        json.dump(params, outfile)

    #### make a system call
    os.system("python dummy.py input.txt");

    # Get the number of function runs each 100 NOT necessary, just checking
    i +=1 
    if i%100 ==0:
        print " number of evaluations {}".format(i)

    #### read the output file and return the value
    mu = np.loadtxt("output.txt");

    return(np.array(mu))

# the model
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement

    # Priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10);
    beta1 = Normal('beta1', mu=0, sd=10);
    beta2 = Normal('beta2', mu=0, sd=10);
    sigma = HalfNormal('sigma', sd=1);

    # Expected value of outcome
    mu = proc_test(alpha, beta1, beta2);

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

    # Inference!
    #start   = find_MAP(); # Find starting value by optimization
    step = Metropolis()
    trace = sample(100,step)

#plot results
plt.figure(figsize=(12, 12))
summary(trace)    
traceplot(trace)

This implementation has the drawback of not having a gradient calculator. Thus you can not use MAP estimation or samplers which require gradient calculation (eg. NUTS). I use here Metropolis.

Perhaps someone with deeper knowledge on pymc3 could help. If you have a function called as an external cmd command + reading outputs from file, is there a way to implement a gradient calculation for the Theano op?

I wonder why is this not a recursive issue in pymc3 (I haven't seen practically questions or answers on this). I imagine some people would want to do inference in external/not own models which wasn't an issue in pymc2. But migrating to pymc3 is not trivial.

junpenglao commented 7 years ago

Perhaps someone with deeper knowledge on pymc3 could help. If you have a function called as an external cmd command + reading outputs from file, is there a way to implement a gradient calculation for the Theano op?

If the generation procedure is a complete black box, I guess the only way maybe is a stochastic gradient - but that would still be very inefficient.

I wonder why is this not a recursive issue in pymc3 (I haven't seen practically questions or answers on this). I imagine some people would want to do inference in external/not own models which wasn't an issue in pymc2. But migrating to pymc3 is not trivial.

I think any large model (with lots of parameters) is unlikely to work in PyMC2 as well, as Metropolis will not be able to explore the full parameter space. I think one way worth to try is the newly implemented SMC (thanks to @hvasbath), otherwise you can try some Approaximate Bayesian computation approach like elfi.

hvasbath commented 7 years ago

I can just repeat what @junpenglao said. I had the same problem like you @Argantonio65 , which is why I started implementing that code. However, if your external program is fast and the forward model not expensive you can still implement the grad method of your OP and still get the gradients. You would just need to repeatedly call your external program and for each parameter keep the others fixed and change the current one just a little. The typical numerical gradient calculations. But you would need to use the proper Op class instead of the as_op above, as this doesnt support the gradient... However, if your external program takes a long time it is waaaaaay more efficient to simply use SMC, especially if you have many model parameters.

Argantonio65 commented 7 years ago

Thanks @hvasbath @junpenglao I will try to use your implementation of SMC (as my likelihood evaluates quite slowly, thus no point of trying the gradient version).

hvasbath commented 7 years ago

Hey @Argantonio65 any updates on these? We are curious on how SMC performs on many different models as this is still a little experimental. Although we have now few feedbacks that it seems to work well we would like to know how its working for your case?

Argantonio65 commented 7 years ago

Hello @hvasbath, my likelihood was quite slow to evaluate (simple version of the model ~ 2-3 min) so I decided to emulate the dynamic model (just data based) and then infer on the surrogate model. My dynamics are quite smooth to the parameters so for dimension <10 the emulated structure gave very good results. And being the surrogate evaluated in 1-2 seconds the problem with the sampling efficiency was practically gone, so I stopped trying with more efficient samplers.

It is nice if SMC performs well. If I manage to get some free time with the model (I can not access to it at least for 2 weeks) I might try with SMC.

I see that there is an example of usage at SMC2_gaussians. Do you have an example in which the likelihood is based on observations? I understood that the likelihood should be defined as a random variable (like in llk = pm.Potential('llk', two_gaussians(X))).

junpenglao commented 7 years ago

I have an example here fitting a mixed effect model with different sampler: https://github.com/junpenglao/GLMM-in-Python/blob/3f1068dce88be6a1eeeda2654b8e0c7c1638d611/pymc3_different_sampling.py#L87-L107

dreamqin68 commented 7 years ago

what's the 'SMC' ?

hvasbath commented 7 years ago

Sequential Monte Carlo, a sampler in the step-methods you may try to use for your model