ijmbarr / causalgraphicalmodels

Causal Graphical Models in Python
MIT License
240 stars 44 forks source link

Compatability with pgmpy #1

Closed mrklees closed 5 years ago

mrklees commented 5 years ago

Hi @ijmbarr,

I love the work that you've been doing in this repo! I've been working with the folks over at the pgmpy to try to implement Causal Models on top of their implementation of Bayesian Networks. pgmpy provides rich support for many types of probabilistic graphical models on top of networkx, but has been lacking this crucial application of PGMs. That said, seeing how well you've implemented these features, I would love to get your thoughts on migrating them into pgmpy's API! I'm happy to take the lead on this, but I want to make sure that you get proper credit and hopefully continue to collaborate with you to add more features.

ijmbarr commented 5 years ago

Hi @mrklees

That sounds like an interesting idea. I'd be happy to help where I can.

I'm not too familiar with pympy, but I'll take a look into it.

Do you have a rough outline of the goals of the project, or any work in progress I could take a look at?

Iain

mrklees commented 5 years ago

Awesome! The relevant code is available in my fork of the repository. You can see some examples of its use in this Colab Notebook. It actually didn't take me too long to fold in your code for backdoor adjustment, and I image it won't take too long for me to get frontdoor in there as well and confirm it's working. You'll see a few differences in structure, because pgmpy separates models from inference. I added a few optimizations, which I had found in my previous implementation. I haven't been diligent about adding issues, but I will try to get those up to be a little more organized.

My primary goal at this point is to round out the API so that given a query it will compute the average treatment effect. At this point we've got the estimand, but I want to get a step beyond that. At that point, I'm hopeful that we will have something that could go into a release. After that, I think we have a bunch of interesting options including:

ijmbarr commented 5 years ago

Thanks - I'll take a look into it and let you know what I think.

ijmbarr commented 5 years ago

Hey

I've only had a quick look at things, and I'm still not entirely sure I have a good enough overview of how pgmpy is designed - so please disregard these comments if they don't make sense. I'm also happy to move them over the discussion at the pgmpy repo if that helps.

Comments/Thoughts:

  1. I find that it helps my understanding of the causal inference process to separate out the kinds of questions being asked. A nice separation is what Pearl presents as his "inference engine" in http://ftp.cs.ucla.edu/pub/stat_ser/r481.pdf. Specifically, the separation between:
    • Representing assumptions
    • Causal Identification/Calculating the Causal Estimand
    • Using the Estimand + Data to actually compute the causal estimates

My CausalGraphicalModels package, and I think your current extension of pgmpy aim to tackle the first two of these points. As such, I'm not sure these count as "inference" and would probably move them into their own model class - maybe as a subclass of DAG? This would allow you to reuse a lot of the machinery in that class for calculating things like conditional independences. Maybe just as a separate class which can be created from any DAG. Either way I don't think it fits with pgmpy's notion of Inference class fits this - it seems to be tied to creating a factor graph.

Specifically the kind of class I'm thinking about would allow users to answer the question: "Given {assumed structure, described as a model} can I calculate {causal query of the form P(X|do(Y))}?". These questions can be asked and answered independently of what data is available or what is known about the conditional probability distributions. All you need is the assumed structure and some notion of what is observed/unobserved. This class would allow natural extensions to more general identification questions.

  1. To move from calculating an estimand to actually calculating an estimate like ATE, you would need to combine this model with estimates of the relevant conditional probability distributions and apply the estimators. This would depend on the specifics of the model you were working from:
    • a BayesianModel with known discrete conditional probability distributions could be calculate directly from the front door/back door/more general formulas.
    • I don't understand the current SEM proposed, but it looks like it assumes linear relationships between variables and Gaussian noise? if so, it should be possible to analytically estimate the ATEs.
    • for more general "estimate directly from data" schemes I guess you could create a set of causal estimators that can be applied to data from the CausalInference class. Extensions to more general identification questions, beyond the simple discrete CPD question is not trivial.

This one's difficult - I don't know what the best interface is. A good approach would be to set up a set of problems that we would like such a class to solve, and try a few out to understand the relative trade-offs.

  1. Extending this to counterfactual queries would be interesting!. To do this you would need a way to generally describe a system in terms of general Structural Causal Models, preform inference on the latent variables, and then model the system after the intervention. One approach I've considered would be describe the system as in terms of a general Bayesian modelling system like STAN/Edward/Pyro and leveraging their inference systems. It would not be simple though, and I'm not sure how easy this approach would be to fit into pgmpy.
mrklees commented 5 years ago

Thanks for the thoughtful response! I definitely agree with you first point! Several of the methods were redundant with what was written in DAG, but I just hadn't spend the time to think about how I could get one to function with the other. It didn't end up being too bad though :) I also went the easy route and just leverage statsmodels to estimate the ATE given the estimands. So far the ATE from the backdoor adjustment seems to be working :D

I've considered similar ideas for counterfactuals. Pyro even has a do operator built in, though it's pretty poorly documented. I'm likewise not quite sure how the SEM model will work in pgmpy, but it seems possible that you could likely adapt a lot of that code to work with counterfactuals if you can just apply the do operator.

ijmbarr commented 5 years ago

One final note on making causal inferences: be as explicit as possible about the assumptions your estimator is making. It looks like you are using linear regression and including the confounders as covariates - the assumption behind it is that the relationships between the variables in the structural causal model are linear. In situations where this assumption fails, the results can be misleading.

munichpavel commented 4 years ago

Just a quick thank-you for having this conversation in an issue, so others like myself can profit from it.