pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.67k stars 2k forks source link

PyMC3 Modeling Show-stoppers #840

Closed fonnesbeck closed 6 years ago

fonnesbeck commented 9 years ago

In my attempts to use PyMC3 as a production environment for modeling, I frequently run into show-stopper issues that prevent me from seeing the project through to completion. While the API is becoming more stable and usable for models of low to moderate complexity, more specialized models are still troublesome (at least, for me). Before PyMC3 leaves beta, it would be nice to be able to overcome some of these. For example, not being able to specify vectors of multivariate variables (#535) is a common roadblock. I'm opening this issue mostly as a clearing house for modeling issues in PyMC3.

To seed this effort, here is an example that pops up repeatedly: using mixtures of variables and constants (data). For example, having variables that have some elements fixed and others variable, or multiplying a tensor by constants (shared variables).

Here is an disease dynamics model that illustrates this:

from pymc3 import *
import theano.tensor as tt
from theano import shared
from theano.tensor.nlinalg import matrix_inverse as inv
invlogit = tt.nnet.sigmoid

with Model() as model:

    n_periods, n_groups = I_obs_t.shape

       mu = Normal('mu', mu=0, tau=0.0001, shape=n_groups)
    sig = HalfCauchy('sig', 25, shape=n_groups, testval=np.ones(n_groups))

    Tau = tt.diag(tt.pow(sig, -2)) 

    # Age-specific probabilities of confirmation as multivariate normal 
    # random variables
    beta_age = MvNormal('beta_age', mu=mu, tau=Tau, 
                        shape=n_groups)

    p_age = Deterministic('p_age', invlogit(beta_age))

    p_confirm = invlogit(beta_age[age_index])

    # Confirmation likelihood
    lab_confirmed = Bernoulli('lab_confirmed', p=p_confirm, observed=y)

    # Confirmed infecteds by age
    I_conf = Binomial('I_conf', I_obs_t, p_age, shape=I_obs_t.shape)

    I = I_conf.sum(1)

    ##################
    A = Normal('A', 4, 1)
    natural_susc = tt.exp((-1/A) * np.tril(FOI_mat).sum(0))[::-1]

    total_susc = shared(sia_susc) * shared(vacc_susc) * natural_susc

    p_susc_age = shared([total_susc[s].mean() for s in age_slices])
    ##################

    # Estimated total initial susceptibles
    S_0 = Binomial('S_0', n=shared(N_age.values), p=p_susc_age, shape=N_age.shape)

    # Susceptibles at each time step (subtract cumulative cases from initial S)
    S = S_0.sum() - I.cumsum()

    # Transmission parameter
    beta = Uniform('beta', 1, 100)         

    # Force of infection
    lam = beta * I * S / N_age.sum()

    # Likelihood of observed cases as function of FOI of prevous period
    like = Potential('like', 
            dist_math.logpow(lam[:-1], I[1:]) - dist_math.factln(I[1:]) - lam[:-1])

In the section highlighted by hashes, I'm combining data with deterministic transforms of tensors, then using the result as the binomial probability for S_0. This results in an issue with the Binomial stochastic:

/Users/fonnescj/Github/pymc3/pymc3/distributions/discrete.py in __init__(self, n, p, *args, **kwargs)
     37         self.n = n
     38         self.p = p
---> 39         self.mode = cast(round(n * p), self.dtype)
     40 
     41     def random(self, point=None, size=None, repeat=None):

TypeError: unsupported operand type(s) for *: 'TensorSharedVariable' and 'SharedVariable'

it does not like having to perform operations between a TensorSharedVariable and a SharedVariable. Its not clear why this is the case, or how to work around it.

More generally, it is never quite clear when to use shared variables and when to stick with NumPy ndarrays. For example, in the GHME example, the interpolate function happily operates on both ndarrays and PyMC3 variables. So, some problems that crop up (perhaps this one) is more an issue of providing adequate documentation and examples (or having appropriate syntactic sugar to hide the Theano idiosyncrasies), rather than lack of functionality.

datnamer commented 9 years ago

+1 on numpy vs theano variables.

springcoil commented 8 years ago

I agree about the syntactic sugar needed and documentation. I think also for the conversion of CamDavidsonPilon book on Bayesian Methods from PyMC2 to PyMC3 we need stuff like this

fonnesbeck commented 8 years ago

@springcoil that's probably a good litmus test on when we are ready to come out of beta -- when we can replicate Cam's book in PyMC 3.

springcoil commented 8 years ago

There is ongoing work but I think the project has stalled a bit.

fonnesbeck commented 8 years ago

Here is another show-stopper

I have attempted to create a Bayes network using PyMC 3, but have hit a wall. Much of it comes down to the inscrutability of Theano, and the awkwardness of having to express everything as Theano objects. This makes it impossible for non-expert users to flexibly build Bayesian models (even relatively simple ones like this). This is another example of a project that I have been forced to drop back to PyMC 2.3 in order to complete.

datnamer commented 8 years ago

@fonnesbeck is obviating these difficulties even technically possible or feasible?

fonnesbeck commented 8 years ago

@datnamer This issue is mainly intended as a clearing house for examples of where modeling with PyMC3 hits roadblocks. Some of the problems will not be surmountable, to be sure, but where possible we need to wrap as much of the Theano that is user-facing in Python functions and classes. One of the major advantages of PyMC over other Bayesian modeling packages is that we get to build our models in Python; if it becomes necessary to be conversant in a domain-specific language in order to build non-trivial models, then we lose this advantage.

datnamer commented 8 years ago

@fonnesbeck, I completely agree. I just wasn't sure of the feasibility of doing so.

ghost commented 7 years ago

All my attempts to use Pymc3 for Bayesian logistic regression with a large number of features (e.g. independent variables) failed. Theano always crashes running out of memory even when limiting the number of features (no interactions) to 300 on a computer with 128GB of memory. Not to mention the extremely long running times. I can provide sample code if anyone is interested.

fonnesbeck commented 7 years ago

@QuantScientist An example would be helpful, yes.

I have a toy example here that also dies, with the "nesting level exceeded maximum of 256" error that we have seen before (#624). For a framework that is used to do deep learning (Theano), I'm confused by issues like this. It must be something that Patsy is doing, as this only happens with the glm module.

ghost commented 7 years ago

@fonnesbeck Thanks for the prompt reply. Indeed I already reported the nesting error which you reproduced in the Jupyter notebook. Even after decreasing the number of features to 255 (to overcome the nesting issue), I get an out of memory error thrown from Theano.

This is the code I am using:


from scipy.special import expit

def fastPredict(new_observation, theta, intercept):    
    v =  intercept + np.einsum('j,j->',new_observation, theta)    
    return expit(v)

def plot_traces(traces, retain=1000):    
    ax = pm.traceplot(traces[-retain:], figsize=(12,len(traces.varnames)*1.5),
        lines={k: v['mean'] for k, v in pm.df_summary(traces[-retain:]).iterrows()})

    for i, mn in enumerate(pm.df_summary(traces[-retain:])['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')        
import pymc3
from pymc3.glm import glm
from pymc3.glm.families import Binomial

niter=2000
i=0
with pymc3.Model() as logistic_model:
    pymc3.glm.glm(s, X_df_train_SINGLE, family=pymc3.glm.families.Binomial())
    print 'Iteration:' + str(i)
    i=i+1
    step = pymc3.NUTS()
    trace_logistic_model = pymc3.sample(niter, step=step,progressbar=True)    

df_trace_logistic_model = pm.trace_to_dataframe(trace_logistic_model[niter//2:])
w_theta = df_trace_logistic_model[last_cols].mean(0)
df_trace_logistic_model.to_csv("df_trace_logistic_model.csv")
w_theta.to_csv("w_theta.csv")
w_intercept=df_trace_logistic_model['Intercept'].mean(0)

print w_theta
print w_intercept 
plot_traces(trace_logistic_model, retain=100)

I am considering using PyStan, though I was unable to find a good example of logistic regression/binary classification.

Best,

fonnesbeck commented 7 years ago

Here is a toy example of a linear regression with 1000 parameters and 100 observations that runs easily, as it does not use the glm module.

In general, I would only use glm as a convenience function for small problems, and not for something large like this.

fonnesbeck commented 7 years ago

@twiecki we should perhaps consider warning users about this limitation in the docs until we sort it out.

ghost commented 7 years ago

@fonnesbeck thanks for the example. Maybe I am missing something, what should I use instead of: pymc3.glm.glm(s, X_df_train_SINGLE, family=pymc3.glm.families.Binomial()) ? I am using logistic regression with a binary response variable, not linear regression. Thanks,

fonnesbeck commented 7 years ago

@QuantScientist I've added a logistic regression example to the gist here. You just need to use a binomial likelihood instead of Gaussian, and transform the mean to the probability scale.

ghost commented 7 years ago

@fonnesbeck This is really appreciated!. I made progress thanks to your gist.

One question: in your example the feature names are b_1, b_2, .... b_k etc, while I am interested in retaining the original feature names (e.g. the first row of the training-set Data Frame so that I can later predict using those feature names. Refer to the attached picture. Is that possible? Also, why in your example, K is equal to the number of features? big_logistic_regression

fonnesbeck commented 7 years ago

I think of the data matrix as being n x k, where n is the number of observations and k the number of predictor variables (or features, as you ML folks call them). I just simulated 1000 fake variables to use in the model to illustrate that PyMC can fit really big regressions without choking.

ghost commented 7 years ago

@fonnesbeck thanks, I resolved all the issues I had, the only problem is that the model performs poorly compared with classical logistic regression. I shall post the full code base for this example. I am participating in a Kaggle contest, my classic model got me to the ~30th place currently, but the score of the Pymc based model is way way below that. Once again, thanks.

fonnesbeck commented 7 years ago

I would be interested in seeing it. You should be able to replicate a classical analysis almost exactly using Bayesian methods, by using the appropriate priors.

plandrem commented 7 years ago

Has there been any resolution to mixing constants and variables, as mentioned in the original post? I'm trying to construct a lower-triangular matrix with ones on the diagonal and random variables in the lower entries. I can't find documentation or issues anywhere on this - is this possible?

fonnesbeck commented 6 years ago

No longer applicable, I think.