Closed bouzaghrane closed 4 years ago
@timothyb0912 I have a question about refactoring, specifically for our project. I feel that in our case given that we have to encode a causal graph and our assumptions about the data generating process. In my opinion, it wouldn't make sense to create functions that ask for user input about the causal graph as well as the relationships between each of the nodes in our causal graph before said functions can run any simulation.
I think this means that we will always have a notebook that doesn't contain any input data or parameters, but which interactively generates this data/parameters (which is in this case will be a causal graph and any encoded assumptions in it). The output of this notebook would be a serialized causal graph object (not sure if we can do that) that we save as an output and load into other notebooks that we can use for simulation. Any thoughts about my logic here?
@bouzaghrane I actually disagree. I think it's very reasonable to keep the specification of the causal graph (the input) separate from the functionality that uses that graph for simulation and anything else. You would then specify how the input object will look.
@bouzaghrane and @hassanobeid1994 In my opinion, three problems have been encountered in the project so far.
I'll start with the first two in this comment and come to the third in a later comment on this issue.
Please reply to this comment and let me know what you all think of my proposal at the bottom.
Okay, the first two problems I've seen are that:
I don't bring this up to point fingers. Far from it. I think these two steps were skipped because (a) it's not immediately clear how the two of you should / could perform the steps, and / or (b) performing these steps seemed harder than the benefit that would be derived from them.
Moreover, I think that the VAST majority of papers (i.e. basically all parametric-model-based applied papers I've ever read) fail to do a thorough job of checking their causal graphs and fitted models.
Of course, my opinion on this topic is clear.
I think checking our causal and (estimated/final) computational graphs is the most important part of the project... In my opinion, it makes little sense to use a model or causal graph may be grossly violated by our data at hand.
To this end, I've added a still in progress literature review on falsification of one's causal and post-estimation computational graphs to the repo in Pull Request #30. Based on the papers I've seen so far (and based on work I've done on this for a past project), we already know enough to immediately make use of multiple techniques for falsifying our graphs.
Moreover, I fully believe that we can do a decent and maybe even good job of trying to falsify our models, and I believe we can do so very easily. That is, I believe we could make respectable and beneficial use at least three ways of falsifying and checking our causal graphs within a week of standard amounts of work (10 - 15hrs).
Accordingly, given:
I think we should pivot (once and for all) in terms of what we emphasize as our contributions and work for the conference.
I think we should focus on making the following points / contributions / demonstrations:
Let me know how you all feel about the proposed change in focus / emphasis.
I think showing a computational workflow for checking one's causal graph is the largest contribution we'll make because no one else is doing so or even doing remotely similar work, again, on the most important aspect of the causal inference problem.
For all other aspects (e.g. model interpretation, fitting latent variable models to account for unobserved confounding, etc.), others have already done lots of work in the area.
@bouzaghrane and @hassanobeid1994 here's the final problem that I've seen so far.
Hassan, you had an issue when trying to first apply the deconfounder:
It was not clear how one should specify the causal relations between the unobserved confounder / substitute confounder and the observed covariates / causes.
What you actually did was specify a set of relations that were heavily informed by domain area expertise---considerations of econometric principles such as avoiding mother logit problems.
Importantly, the model that you estimated has a causal graph that was (a) never explicitly drawn and (b) different from the one you initially wrote down.
So to state the issue clearly, we never specified the causal graph that corresponds to the identification strategy that we were using, which is a parametric model of latent mediation with unobserved confounding of the latent mediators.
Essentially, letting Z be an unobserved confounder, Z --> X --> Utility_per_alternative --> mode_choice and Z --> Utility_per_alternative.
In my opinion we absolutely NEED to explicitly draw the causal graph such that is shows our identification strategy. Put another way, our causal graph and the computational graph of our model need to be compatible / encode-and-express the same causal relationships.
Drawing this graph should help us specify the final model in a defensible way as we'll check its causal graph first. Indeed the specification of the final model is the drawing of the graph.
In general I think having our causal and computational graphs agree is a fantastic thing. I'll list some reasons why (aka some ways we can benefit from this correspondence) below:
Hopefully all of that makes sense and is as incredibly exciting to you all as it is to me.
Lastly, note my insistence on us drawing our causal graphs to represent our identification strategy, aka drawing our causal graphs to include our models, essentially closes a loop with the graphical models literature. First there were graphical models for the following identification strategies: unobserved confounding, selection-on-observables, and instrumental variables. See Pearl's original 1995 biometrika paper: http://bayes.cs.ucla.edu/R218-B.pdf. Then recently, there were finally graphs drawn for econometric identification strategies such as reliance on monotonicity of effects, regression discontinuity and propensity score designs. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6117124/.
As of yet, I've not explicitly seen anyone draw graphs for identification via parametric modeling assumptions about unobservables. That's essentially what we're redoing: repurposing the neural network diagram for causal inference, haha. Note however that this is somewhat related to work in computer science on graphical models of preferences and utilities...https://arxiv.org/pdf/1302.4928.pdf
Alright, let me know your questions and thoughts!
Note to myself (and @bouzaghrane and @hassanobeid1994): I was clearly very excited about the idea (and still am): so excited I forgot to check to make sure I wasn't reinventing the wheel.
Choice modelers (aka Joan herself!!!) have long shown causal diagrams with utilities clearly shown as latent mediators. See Figure 1 of the Ben-Akiva, Walker et al chapter from 1997. Oye.
We should should show our causal graphs similarly, perhaps using plate notation to separately show the X_j
and Utility_j
for each alternative j
.
That will be important when we show Hassan's latent confounders with one per (driving?) alternative, included in the plate rather than outside the plate as Wang and Blei might do.
@bouzaghrane I actually disagree. I think it's very reasonable to keep the specification of the causal graph (the input) separate from the functionality that uses that graph for simulation and anything else. You would then specify how the input object will look.
Ah got it!
@timothyb0912 to keep things simple, I think the function to initial the causal model should look this way:
def InitiateModel(x=None, y=None):
"""DOCTSRING TO GO HERE"""
modelDictionary = {**x, **y}
return StructuralCausalModel(modelDictionary)
This function obviously depends on the StructuralCausalModel module that we import in our .py file. We can build additional functions that use functions from StructuralCausalModel to simulate or do whatever else we need to do, or we can use internal methods (the drawback is that we won't have detailed doctrings for them). Here is an example of how this would look working:
## Specify the input of the causal graph in dictionary format
variables = {"HH_Size":
lambda n_samples: np.random.normal(size=n_samples),
"num_of_kids_household":
lambda n_samples: np.random.normal(size=n_samples),
"Autos_per_licensed_drivers":
lambda n_samples: np.random.normal(size=n_samples),
"Gender":
lambda n_samples: np.random.normal(size=n_samples),
"Travel_Distance":
lambda HH_Size, n_samples: np.random.normal(size=n_samples),
"Cross_Bay_Bridge":
lambda Travel_Distance, n_samples: np.random.normal(size=n_samples),
"Travel_Time":
lambda Travel_Distance, n_samples: np.random.normal(size=n_samples),
"Travel_Cost":
lambda Travel_Distance, n_samples: np.random.normal(size=n_samples)
}
choice = {"Mode_Choice":
lambda num_of_kids_household, Cross_Bay_Bridge,Travel_Time,
Travel_Cost, Travel_Distance, HH_Size,Autos_per_licensed_drivers,
Gender, n_samples: np.random.normal(size=n_samples)
}
## Initiate the model based on the specified structural form of each of the variables:
bike_model_ = InitiateModel(variables,choice)
## Sample data using StructuralCausalModel's internal methods (We can write our own
## methods to do the same thing if we want to add explanatory docstrings within them
## or do any additional things?)
bike_model_.sample(n_samples=100)
The output will be a dataframe of simulated data. I think that simplifying the structure of the input (causal graph) more than this will really result in much longer code (but of course stronger code). Let me know what you think @timothyb0912 @hassanobeid1994
@bouzaghrane I'll come back to your questions after the following diversion:
@bouzaghrane and @hassanobeid1994 I think I finally have a concise answer to Vij's (paraphrased) question of "what's the big difference between causal models and econometric models that account for factors like endogeneity and omitted variables?"
So far, the differences that I've seen are that:
@bouzaghrane , your work addresses the first point. @hassanobeid1994, your work addresses the second point.
Soon enough you'll both have work on checking your causal graphs, and at that point, I think we'll be in a pretty good position to put things together for June!
@bouzaghrane, here's my suggestions:
InitiateModel
model function is probably not needed as its just a pass through function with no real operations inside. I would instead use the models
folder in src
to store various causal models. I would store each model in a separate file, each model being its own class. The various causal models that we (i.e. users) specify should all subclass some abstract_base_class
that we define (for e.g. let's call it InputCausalModel
). This abstract class will declare what input causal graphs should be / do. For example, we can specify that they all need to subclass StructuralCausalModel
. The init methods of the specific causal graphs can essentially take no arguments and then just use super
to call the init
method of StructuralCausalModel
and pass in the needed input dictionary declaring how the variables are created.@timothyb0912 We never got to this during the call, but I have a few questions about the calc_probabilities
function in choice_calcs
. If it won't take much of your time, do you mind maybe sharing an example on how it would work? I am trying to use it to replace the functions we have in the notebook so far, but i just want to make sure I am understanding all the needed parameters correctly. If an example would take lots of time, please feel free to explain in whatever way would take the least amount of time. I understand some of the parameters, but not all. Thanks brother!
Ah, for sure. @bouzaghrane, check https://github.com/timothyb0912/pylogit/blob/master/tests/test_choice_calcs.py#L307 and the accompanying attributes added to self during in the setUp method for the test (https://github.com/timothyb0912/pylogit/blob/master/tests/test_choice_calcs.py#L30).
To see how pylogit uses the calc_probabilities
function "from outside", trace the usage back through https://github.com/timothyb0912/pylogit/blob/master/pylogit/base_multinomial_cm_v2.py#L1779 and see how I set the parameters when using the MNL model. That's probably the easiest way to see it.
Also see https://github.com/timothyb0912/pylogit/blob/master/tests/test_base_cm_predict.py#L336
Let me know if things are still unclear!
Also, this is totally another reason to write tests for everything (after we prototype all the way through to the end). Tests provide examples and "living documentation" for all the tested functions!
@timothyb0912 @hassanobeid1994 I put together a little workflow that I want to discuss today. Here it is:
Here I try to illustrate the procedure for simulating data based on a causal graph.
Produce graphics of the causal models of each alternative based on the specification of the model from the asymmetric paper (8 total) to help in visualizing how each variable will be simulated.
Find distributions of demographic variables and alternative specific variables that have no parent in the causal graphs:
Run distributional regression on each alternative specific variables:
Simulate data for nodes with no parents in the causal graphs:
Simulation of variables based on causal graph of each alternative:
Simulation of availability of alternatives based on shares from original dataset
Convert data from wide to long
Simulate choices based on specified utilities
Estimate the model and recover parameters
Repeat steps 4-10 a number of times N_simulations and store outputs of each model
@hassanobeid1994 whenever you're done with the exams, can you edit the workflow to reflect what you wanted to see?
Closing due to lack of use.
Opening this issue for any general questions we have for each other.