hassan-obeid / tr_b_causal_2020

Causal inference conference - June 2020
MIT License
0 stars 0 forks source link

[All] Quick questions & discussions #12

Closed bouzaghrane closed 4 years ago

bouzaghrane commented 4 years ago

Opening this issue for any general questions we have for each other.

bouzaghrane commented 4 years ago

@timothyb0912 I have a question about refactoring, specifically for our project. I feel that in our case given that we have to encode a causal graph and our assumptions about the data generating process. In my opinion, it wouldn't make sense to create functions that ask for user input about the causal graph as well as the relationships between each of the nodes in our causal graph before said functions can run any simulation.

I think this means that we will always have a notebook that doesn't contain any input data or parameters, but which interactively generates this data/parameters (which is in this case will be a causal graph and any encoded assumptions in it). The output of this notebook would be a serialized causal graph object (not sure if we can do that) that we save as an output and load into other notebooks that we can use for simulation. Any thoughts about my logic here?

timothyb0912 commented 4 years ago

@bouzaghrane I actually disagree. I think it's very reasonable to keep the specification of the causal graph (the input) separate from the functionality that uses that graph for simulation and anything else. You would then specify how the input object will look.

timothyb0912 commented 4 years ago

@bouzaghrane and @hassanobeid1994 In my opinion, three problems have been encountered in the project so far.

I'll start with the first two in this comment and come to the third in a later comment on this issue.

Please reply to this comment and let me know what you all think of my proposal at the bottom.

Okay, the first two problems I've seen are that:

  1. Falsification of our initially proposed causal graph was skipped.
  2. Checking of our initially estimated model (aka the computational graph that corresponds to our causal graph) was skipped.

I don't bring this up to point fingers. Far from it. I think these two steps were skipped because (a) it's not immediately clear how the two of you should / could perform the steps, and / or (b) performing these steps seemed harder than the benefit that would be derived from them.

Moreover, I think that the VAST majority of papers (i.e. basically all parametric-model-based applied papers I've ever read) fail to do a thorough job of checking their causal graphs and fitted models.

Of course, my opinion on this topic is clear.

I think checking our causal and (estimated/final) computational graphs is the most important part of the project... In my opinion, it makes little sense to use a model or causal graph may be grossly violated by our data at hand.

To this end, I've added a still in progress literature review on falsification of one's causal and post-estimation computational graphs to the repo in Pull Request #30. Based on the papers I've seen so far (and based on work I've done on this for a past project), we already know enough to immediately make use of multiple techniques for falsifying our graphs.

Moreover, I fully believe that we can do a decent and maybe even good job of trying to falsify our models, and I believe we can do so very easily. That is, I believe we could make respectable and beneficial use at least three ways of falsifying and checking our causal graphs within a week of standard amounts of work (10 - 15hrs).

Accordingly, given:

I think we should pivot (once and for all) in terms of what we emphasize as our contributions and work for the conference.

I think we should focus on making the following points / contributions / demonstrations:

  1. the fact that without a causal graph, we can't causally interpret any econometric models we have estimated.
  2. that accounting for the unobserved confounding that is present in our datasets is possible and it does materially change our substantive conclusions / causal-effect estimates.
  3. that coming up with a plausible causal graph is not trivial, and should be done with the aid of tons of computational and visual tools, some of which are very easy to use.

Let me know how you all feel about the proposed change in focus / emphasis.

I think showing a computational workflow for checking one's causal graph is the largest contribution we'll make because no one else is doing so or even doing remotely similar work, again, on the most important aspect of the causal inference problem.

For all other aspects (e.g. model interpretation, fitting latent variable models to account for unobserved confounding, etc.), others have already done lots of work in the area.

timothyb0912 commented 4 years ago

@bouzaghrane and @hassanobeid1994 here's the final problem that I've seen so far.

Hassan, you had an issue when trying to first apply the deconfounder:

It was not clear how one should specify the causal relations between the unobserved confounder / substitute confounder and the observed covariates / causes.

What you actually did was specify a set of relations that were heavily informed by domain area expertise---considerations of econometric principles such as avoiding mother logit problems.

Importantly, the model that you estimated has a causal graph that was (a) never explicitly drawn and (b) different from the one you initially wrote down.

So to state the issue clearly, we never specified the causal graph that corresponds to the identification strategy that we were using, which is a parametric model of latent mediation with unobserved confounding of the latent mediators.

Essentially, letting Z be an unobserved confounder, Z --> X --> Utility_per_alternative --> mode_choice and Z --> Utility_per_alternative.

In my opinion we absolutely NEED to explicitly draw the causal graph such that is shows our identification strategy. Put another way, our causal graph and the computational graph of our model need to be compatible / encode-and-express the same causal relationships.

Drawing this graph should help us specify the final model in a defensible way as we'll check its causal graph first. Indeed the specification of the final model is the drawing of the graph.

In general I think having our causal and computational graphs agree is a fantastic thing. I'll list some reasons why (aka some ways we can benefit from this correspondence) below:

  1. It shows how parametric, model-based causal inference explicitly "models the science" of a problem, that is, how model-based causal inference explicitly models the mechanisms of the system in question (aka the causal pathway between X and Y, the mode_choice in our case). Essentially, this shows how the models of Rubin, Pearl, and Heckman are all related.
  2. It allows and invites us to make use of all the deep learning methods for:
    • estimating / inferring / optimizing the number and structure of mediating latent variables (aka the number of hidden layers and neurons in those layers, aka the underlying utility / GEV network structure?)
    • diagnosing and measuring over-parameterization of our latent variable model
    • detecting / testing for unobserved confounding between sets of variables by way of methods for estimating the "intrinsic dimension" of those variables.
    • and more.
  3. It allows us to make connections with a bunch of different literatures. By designing the causal graph and then combining it with parametric models of all parts of this graph we're:
    • designing a hybrid choice model and paying much closer than usual attention to model validation / checking
    • designing a (potentially deep? and definitely causal) generative model of our dataset
  4. It clarifies how we should causally interpret our choice models since we can now relay on all the graphical model tools from the causal inference literature for interpreting mediation models.

Hopefully all of that makes sense and is as incredibly exciting to you all as it is to me.

Lastly, note my insistence on us drawing our causal graphs to represent our identification strategy, aka drawing our causal graphs to include our models, essentially closes a loop with the graphical models literature. First there were graphical models for the following identification strategies: unobserved confounding, selection-on-observables, and instrumental variables. See Pearl's original 1995 biometrika paper: http://bayes.cs.ucla.edu/R218-B.pdf. Then recently, there were finally graphs drawn for econometric identification strategies such as reliance on monotonicity of effects, regression discontinuity and propensity score designs. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6117124/.

As of yet, I've not explicitly seen anyone draw graphs for identification via parametric modeling assumptions about unobservables. That's essentially what we're redoing: repurposing the neural network diagram for causal inference, haha. Note however that this is somewhat related to work in computer science on graphical models of preferences and utilities...https://arxiv.org/pdf/1302.4928.pdf

Alright, let me know your questions and thoughts!

timothyb0912 commented 4 years ago

Note to myself (and @bouzaghrane and @hassanobeid1994): I was clearly very excited about the idea (and still am): so excited I forgot to check to make sure I wasn't reinventing the wheel.

Choice modelers (aka Joan herself!!!) have long shown causal diagrams with utilities clearly shown as latent mediators. See Figure 1 of the Ben-Akiva, Walker et al chapter from 1997. Oye.

We should should show our causal graphs similarly, perhaps using plate notation to separately show the X_j and Utility_j for each alternative j.

That will be important when we show Hassan's latent confounders with one per (driving?) alternative, included in the plate rather than outside the plate as Wang and Blei might do.

bouzaghrane commented 4 years ago

@bouzaghrane I actually disagree. I think it's very reasonable to keep the specification of the causal graph (the input) separate from the functionality that uses that graph for simulation and anything else. You would then specify how the input object will look.

Ah got it!

bouzaghrane commented 4 years ago

@timothyb0912 to keep things simple, I think the function to initial the causal model should look this way:

def InitiateModel(x=None, y=None):
    """DOCTSRING TO GO HERE"""
    modelDictionary = {**x, **y}
    return StructuralCausalModel(modelDictionary)

This function obviously depends on the StructuralCausalModel module that we import in our .py file. We can build additional functions that use functions from StructuralCausalModel to simulate or do whatever else we need to do, or we can use internal methods (the drawback is that we won't have detailed doctrings for them). Here is an example of how this would look working:

## Specify  the input of the causal graph in dictionary format
variables = {"HH_Size": 
           lambda n_samples: np.random.normal(size=n_samples),
           "num_of_kids_household":
           lambda n_samples: np.random.normal(size=n_samples),
           "Autos_per_licensed_drivers":
           lambda n_samples: np.random.normal(size=n_samples),
           "Gender":
           lambda n_samples: np.random.normal(size=n_samples),
           "Travel_Distance":
           lambda HH_Size, n_samples: np.random.normal(size=n_samples),
           "Cross_Bay_Bridge":
           lambda Travel_Distance, n_samples: np.random.normal(size=n_samples),
           "Travel_Time":
           lambda Travel_Distance, n_samples: np.random.normal(size=n_samples),
           "Travel_Cost":
           lambda Travel_Distance, n_samples: np.random.normal(size=n_samples)
}
choice = {"Mode_Choice":
           lambda num_of_kids_household, Cross_Bay_Bridge,Travel_Time,
                   Travel_Cost, Travel_Distance, HH_Size,Autos_per_licensed_drivers,
                   Gender, n_samples: np.random.normal(size=n_samples)
}

## Initiate the model based on the specified structural form of each of the variables:

bike_model_ = InitiateModel(variables,choice)

## Sample data using StructuralCausalModel's internal methods (We can write our own 
## methods to do the same thing if we want to add explanatory docstrings within them
## or do any additional things?)

bike_model_.sample(n_samples=100)

The output will be a dataframe of simulated data. I think that simplifying the structure of the input (causal graph) more than this will really result in much longer code (but of course stronger code). Let me know what you think @timothyb0912 @hassanobeid1994

timothyb0912 commented 4 years ago

@bouzaghrane I'll come back to your questions after the following diversion:

@bouzaghrane and @hassanobeid1994 I think I finally have a concise answer to Vij's (paraphrased) question of "what's the big difference between causal models and econometric models that account for factors like endogeneity and omitted variables?"

So far, the differences that I've seen are that:

  1. Causal graphs are fundamental to causal models in all cases, even if there is no unobserved confounding. The causal interpretation of functionals of a given model parameter (e.g. changes in predicted probabilities under a policy) is dependent on the causal graph.
  2. When dealing with observational data and unobserved confounding, causal inference methods emphasize (indicator-free) latent variable models to much greater extents than traditional structural methods in econometrics (including discrete choice methods). These models (e.g. the deconfounder) can substantively change one's results relative to standard models that ignore such unobserved confounding.
  3. With causal inference methods, the causal graph is treated as more than a pictorial representation of one's assumptions. Testable (i.e. statistical) implications are derived from the graph itself (without parametric assumptions) and they are used to falsify / rule-out candidate causal graphs that appear to be inconsistent with one's data.

@bouzaghrane , your work addresses the first point. @hassanobeid1994, your work addresses the second point.

Soon enough you'll both have work on checking your causal graphs, and at that point, I think we'll be in a pretty good position to put things together for June!

timothyb0912 commented 4 years ago

@bouzaghrane, here's my suggestions:

  1. Open a pull request (and branch) and commit your work in progress code. It'll be easier for me to comment there as the refactoring comments will all be kept separate from other discussions.
    1. The InitiateModel model function is probably not needed as its just a pass through function with no real operations inside. I would instead use the models folder in src to store various causal models. I would store each model in a separate file, each model being its own class. The various causal models that we (i.e. users) specify should all subclass some abstract_base_class that we define (for e.g. let's call it InputCausalModel). This abstract class will declare what input causal graphs should be / do. For example, we can specify that they all need to subclass StructuralCausalModel. The init methods of the specific causal graphs can essentially take no arguments and then just use super to call the init method of StructuralCausalModel and pass in the needed input dictionary declaring how the variables are created.
bouzaghrane commented 4 years ago

@timothyb0912 We never got to this during the call, but I have a few questions about the calc_probabilities function in choice_calcs . If it won't take much of your time, do you mind maybe sharing an example on how it would work? I am trying to use it to replace the functions we have in the notebook so far, but i just want to make sure I am understanding all the needed parameters correctly. If an example would take lots of time, please feel free to explain in whatever way would take the least amount of time. I understand some of the parameters, but not all. Thanks brother!

timothyb0912 commented 4 years ago

Ah, for sure. @bouzaghrane, check https://github.com/timothyb0912/pylogit/blob/master/tests/test_choice_calcs.py#L307 and the accompanying attributes added to self during in the setUp method for the test (https://github.com/timothyb0912/pylogit/blob/master/tests/test_choice_calcs.py#L30).

To see how pylogit uses the calc_probabilities function "from outside", trace the usage back through https://github.com/timothyb0912/pylogit/blob/master/pylogit/base_multinomial_cm_v2.py#L1779 and see how I set the parameters when using the MNL model. That's probably the easiest way to see it.

Also see https://github.com/timothyb0912/pylogit/blob/master/tests/test_base_cm_predict.py#L336

Let me know if things are still unclear!

Also, this is totally another reason to write tests for everything (after we prototype all the way through to the end). Tests provide examples and "living documentation" for all the tested functions!

bouzaghrane commented 4 years ago

@timothyb0912 @hassanobeid1994 I put together a little workflow that I want to discuss today. Here it is:

Here I try to illustrate the procedure for simulating data based on a causal graph.

  1. Produce graphics of the causal models of each alternative based on the specification of the model from the asymmetric paper (8 total) to help in visualizing how each variable will be simulated.

  2. Find distributions of demographic variables and alternative specific variables that have no parent in the causal graphs:

    • Input: a vector of values for each variable separately for each alternative, and a vector of of values for the demographic variables
    • Output: a nested dictionary, each nest representing an alternative and the final nest being demographic variables. Each of the nests will be a dictionary of distributions and parameters for each alternative specific variables, so we will have a distribution for per variable per alternative.
  3. Run distributional regression on each alternative specific variables:

    • This regression will be done for each alternative specific variable, per alternative
    • Input: a vector of X and Y depending on the causal graph (theses Xs and Ys, for example travel distance and travel time will be from the original dataset)
    • Output: a nested dictionary, similar to the nested dictionary from step 2, storing each of the found distributions/parameters.
  4. Simulate data for nodes with no parents in the causal graphs:

    • Input: distribution of the demographic variables and distribution of each of the alternative specific variables, for each alternative
    • Ouput: dataframe with length N and all demographic variables and alternative specific variables that do not have any parents
    • Each row in this dataframe will represent an observed individual, assuming that each individual has all alternatives available
  5. Simulation of variables based on causal graph of each alternative:

    • For each alternative, simulate data for the nodes in the causal graph of its utility, except for its utility?
    • Input: distribution and parameters for nodes with no parents AND relationships based on regression from 3
    • Output: a simulated dataset of alternative specific variables and demographic variables, this will already take in a size N (from step 4) parents and will result in size N dataframes for each alternative
  6. Simulation of availability of alternatives based on shares from original dataset

    • Output: dataset, in wide format, with availability of each alternative.
  7. Convert data from wide to long

    • Input: Wide dataset
    • Output: Long format dataset
  8. Simulate choices based on specified utilities

    • Input: Long format dataset without simulated choices
    • Output: Long format dataset with simulated choices
  9. Estimate the model and recover parameters

    • Input: long format dataset with simulated choices
    • Output: Estimated models
  10. Repeat steps 4-10 a number of times N_simulations and store outputs of each model

bouzaghrane commented 4 years ago

@hassanobeid1994 whenever you're done with the exams, can you edit the workflow to reflect what you wanted to see?

timothyb0912 commented 4 years ago

Closing due to lack of use.