py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.99k stars 923 forks source link

Multi-dimensional confounders for CausalModel #998

Closed markov24 closed 1 year ago

markov24 commented 1 year ago

Question How exactly would DoWhy handle multi-dimensional confounders. From all of the examples that I see, all nodes within graphs are usually just 1 dimensional, and in the cases of multiple nodes acting as confounders, all of the examples just defined separate nodes. I am working on a case involving textual confounding, so I wanted to use DoWhy where my confounder is a bag of words representation of text - this is a matrix with 50,000 columns, so I can't enter in 50,000 nodes. When simply using 1 node with the matrix, I kept getting hashing issues. I tried inputting it as a list of tuples, which computes causal effects, but I'm not sure if everything under the hood is working out correctly. Is there any way to use DoWhy where a single node/variable can be a matrix.

amit-sharma commented 1 year ago

Representing a graph node with multi-dimensional data is not supported.

But here's a way to achieve your task without using the graph explicitly. You can create a dataframe with 50,000 columns containing your text data. Then you can simply include all those column names as confounders.

m = CausalModel(df, 
treatment="v0", 
outcome="y", 
common_causes = ["w"+str(i) for i in range(50000)]

See this notebook for a full example.

markov24 commented 1 year ago

Thank you for the reply! I implemented what you said, but the following takes way too long to compute:

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="maximal-adjustment")

Even by changing the method_name to not be default it still takes over 2hr for the above to run. Would there be any way to go around this? When I previously passed the text in as a tuple, it still computed causal effects so I'm not sure if whichever default learners are used handled the tuple correctly - is there a way to use a custom logistic regression for the propensity estimators, ones that I can adjust such that they handle tuples correctly, or would there be a way to use CausalModel without having to obtain the model.identify_effect

amit-sharma commented 1 year ago

Oh, I see. The default in dowhy might not be tuned for such a large graph. Can you share a minimum working example that takes 2 hrs for identify_effect? Fixing that will be one solution.

If you are already sure of what backdoor variables you want to use, another trick is to simply create an IdentifiedEstimand object of your own. As long as you mention the treatment_name, outcome_name, and backdoor variables, you should be able to use it downstream. The backdoor variables needs to be a dict d where d["backdoor"]= list(your-backdoor-variables).

See IdentifiedEstimand constructor here. Actually the simplest method may be to obtain an identifiedestimand object using a smaller version of your graph and simply change the backdoor variables.

markov24 commented 1 year ago

Thank you for the help! Just to be sure, you would be saying that doing the following

model = CausalModel(df, 
                    treatment="a", 
                    outcome="y", 
                    common_causes = ["w" + str(i) for i in range(500)]
                    )
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="maximal-adjustment")

identified_estimand.backdoor_variables = {'backdoor': ["w" + str(i) for i in range(50000)]}

where I initially set it to 500 so it computes it very quickly, and then change it back to the original size, this would take care of everything? If I were to then compute the ATE using IPW as

causal_estimate_ipw = model.estimate_effect(identified_estimand,
                                            method_name="backdoor.propensity_score_weighting",
                                            target_units = "ate",
                                            method_params={"weighting_scheme":"ips_weight"}) 

would this use all 50,000 columns (so the only thing important for the estimator is the identified_estimand.backdoor_variables? Should I be worried that after it finishes running it prints out only the original 500?

### Estimand : 1
Estimand name: backdoor
Estimand expression:
 d                                                                            
────(E[y|w335,w317,w135,w282,w391,w20,w485,w381,w58,w347,w138,w424,w443,w15,w2
d[a]                                                                          

39,w41,w177,w327,w71,w101,w483,w389,w392,w139,w136,w156,w494,w208,w437,w210,w4

87,w42,w275,w290,w331,w209,w449,w236,w294,w16,w324,w435,w455,w265,w214,w225,w4

44,w143,w21,w359,w120,w82,w168,w161,w274,w221,w295,w103,w297,w199,w354,w414,w3

...
## Estimate
Mean value: 0.14533443433985832
amit-sharma commented 1 year ago

Not exactly. Need to create the estimand from an earlier dataset and then update the backdoor variables. this is because estimate_effect should use the new dataset.

Here's a sample code.

from dowhy import CausalModel
import numpy as np
import pandas as pd

# full dataset
bigN = 5000
arr = np.random.random((1000,bigN+2))
df = pd.DataFrame(data=arr, columns=["a", "y"]+ ["w" + str(i) for i in range(bigN)])

# dataset with only 20 confounders
df_small = pd.DataFrame(data=arr[:, :22], columns=["a", "y"]+ ["w" + str(i) for i in range(20)])
model = CausalModel(df_small, 
                    treatment="a", 
                    outcome="y", 
                    common_causes = ["w" + str(i) for i in range(20)]
                    )
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="maximal-adjustment")
print(identified_estimand)
# Now creating new causalmodel and assigning backdoor variables
model=CausalModel(
data =df,
treatment='a',
outcome='y',
common_causes = ["w" + str(i) for i in range(bigN)]
)
identified_estimand.backdoor_variables = {'backdoor': ["w" + str(i) for i in range(bigN)]}
causal_estimate = model.estimate_effect(identified_estimand,
method_name='backdoor.linear_regression')
print(causal_estimate)

In the output, look for the Realized Estimand section. That should be correct. Note that the Estimand expression under backdoor will be incorrect, and that is okay.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

amit-sharma commented 1 year ago

@markov24 were you able to run your code with multiple confounders?

markov24 commented 1 year ago

Yes, everything works perfectly, thanks for the help!