py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.87k stars 915 forks source link

Auto assign_causal_mechanisms is taking so much time in gcm #1214

Open Abu-thahir opened 5 days ago

Abu-thahir commented 5 days ago

@bloebp @amit-sharma I tried to run the Online Sales Shop example, which is available here: https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_online_shop.html.

auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD); print(auto_assignment_summary)

This code method is running for hours and hours, with no output. Why is this so? Is this the intended behaviour? Also, the gcm.evaluate_causal_model method isn't working for me.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Version information:

bloebp commented 4 days ago

Hi, I think someone else has reported a similar issue. It was due to using Python 3.12 (DoWhy only supports versions smaller than 3.12, e.g., 3.11) and the installed scikit-learn version. Can you double-check if you have DoWhy 0.11.1 installed (with Python 3.12, it will fall back to 0.8, I think).

Generally auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD) in this example should be quite fast (probably under 20 seconds). Can you try to uninstall scikit-learn and re-install it again (or upgrade it)?

Also, the gcm.evaluate_causal_model method isn't working for me.

Do you have an error message? If the method was not found, then it is most likely due to having an older DoWhy version installed.

I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?

Normally, each node requires a causal mechanism to describe its data generation process. The assign_causal_mechanisms function aims to automate this process with some "heuristics", so you don't have to do it manually. You can check

for more information about customizing them if you want to assign them manually. Generally, you can either prepare your own model or use an existing wrapper to, e.g., assign any SciPy distribution to root nodes or regression/classification models for (additive noise models in) non-root nodes. The example notebook https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_rca_microservice_architecture.html shows some reasoning process about selecting the models manually.

Abu-thahir commented 1 day ago

@bloebp Thank you for helping me out. I resolved the difficulties by downgrading my Python version. However, there is another issue with OneHotEncoder in dowhy's util package.

Issue 1 :

45     if drop_first:
     46         drop = "first"
---> 47     encoder = OneHotEncoder(drop=drop, sparse=False)  # NB sparse renamed to sparse_output in sklearn 1.2+
     49     encoded_data = encoder.fit_transform(data_to_encode)
     51 else:  # Use existing encoder

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

The parameter for OneHotEncoder (sparse) in Scikit-learn has been renamed "sparse_output". This must be modified; else, an encoding error will occur.

I attempted to downgrade Scikit-learn below 1.2.0, but encountered other dependence wheel package difficulties. So i think its must to make this change!

bloebp commented 21 hours ago

I think I found the code piece related to this. I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

bloebp commented 21 hours ago

Opened a fix PR: https://github.com/py-why/dowhy/pull/1219

Abu-thahir commented 20 hours ago

I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.

@bloebp , I haven't used gcm, but I've worked extensively with the dowhy API. In my use case, I want to perform causal analysis on many treatments over the same outcome variables, and the system should be scalable enough to handle concurrent queries. For example, I'm looking for the causal effect values for a treatment variable named Campaign Name, which has 6 to 7 campaigns, vs the outcome variable "Sales". So I OneHotEncode the data manually and limit the number of unique values for variables to ten because onehotencoding causes the curse of dimensionality.

I am curious if it is possible to serve requests concurrently if I provide the whole categorical data with 50-100 unique values directly to fit. Also, would this be scalable, or will it cause large memory consumption difficulties as dimensionality increases?

bloebp commented 20 hours ago

Ah ok got it. In this case, since you have particular target variable in mind, maybe you can check alternative encoding methods, such as CatBoostEncoder. We have an implementation here: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/catboost_encoder.py#L37

Basically what you can try is:

my_encoder = CatBoostEncoder()
df['MyCategoricalColumn'] = my_encoder.fit_transform(X=df['MyCategoricalColumn'].to_numpy().reshape(-1), Y=df['MyTargetVariable'].to_numpy().reshape(-1))
Abu-thahir commented 20 hours ago

However, CatBooEncoder will encode the column called Campaign_Name within the same column, which is a categorical treatment variable, but what i need here is to get the ATE values for each campaign over the outcome variable "Sales".

If I do onehotencoding, I may retrieve the ATE value for each campaign because it is viewed as another treatment variable against the outcome "Sales".

Is it possible to obtain the ATE values for each Campaign in the Campaign_Name variable using catBoost in GCM?

bloebp commented 20 hours ago

Ah ok, yea the intervention value becomes rather abstract if you do a catboost encoding, while you still have a clear interpretation in one-hot-encodings.

So, in case of GCM, you can explicitly set the campaign value and see the effect, the transformation (e.g. catboost) would then happen internally. You can check this: https://www.pywhy.org/dowhy/v0.11.1/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html

Basically, you need to compare two reference treatments, like

gcm.average_causal_effect(causal_model,
                         'Sales',
                         interventions_alternative={'Campaigns': lambda x: 'MyFirstCampaign'},
                         interventions_reference={'Campaigns': lambda x: 'MySecondCampaign'},
                         num_samples_to_draw=1000)

Generally, using DML for effect estimation might be more robust than a GCM, but you can give it a shot.