Open Abu-thahir opened 5 days ago
Hi, I think someone else has reported a similar issue. It was due to using Python 3.12 (DoWhy only supports versions smaller than 3.12, e.g., 3.11) and the installed scikit-learn version. Can you double-check if you have DoWhy 0.11.1 installed (with Python 3.12, it will fall back to 0.8, I think).
Generally auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD)
in this example should be quite fast (probably under 20 seconds). Can you try to uninstall scikit-learn
and re-install it again (or upgrade it)?
Also, the gcm.evaluate_causal_model method isn't working for me.
Do you have an error message? If the method was not found, then it is most likely due to having an older DoWhy version installed.
I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?
Normally, each node requires a causal mechanism to describe its data generation process. The assign_causal_mechanisms
function aims to automate this process with some "heuristics", so you don't have to do it manually. You can check
for more information about customizing them if you want to assign them manually. Generally, you can either prepare your own model or use an existing wrapper to, e.g., assign any SciPy distribution to root nodes or regression/classification models for (additive noise models in) non-root nodes. The example notebook https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_rca_microservice_architecture.html shows some reasoning process about selecting the models manually.
@bloebp Thank you for helping me out. I resolved the difficulties by downgrading my Python version. However, there is another issue with OneHotEncoder in dowhy's util package.
Issue 1 :
45 if drop_first:
46 drop = "first"
---> 47 encoder = OneHotEncoder(drop=drop, sparse=False) # NB sparse renamed to sparse_output in sklearn 1.2+
49 encoded_data = encoder.fit_transform(data_to_encode)
51 else: # Use existing encoder
TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'
The parameter for OneHotEncoder (sparse) in Scikit-learn has been renamed "sparse_output". This must be modified; else, an encoding error will occur.
I attempted to downgrade Scikit-learn below 1.2.0, but encountered other dependence wheel package difficulties. So i think its must to make this change!
I think I found the code piece related to this. I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit
it would already automatically encode categorical data accordingly, no need to do this manually.
Opened a fix PR: https://github.com/py-why/dowhy/pull/1219
I am wondering, since this is not part of the GCM package, do you manually encode the categorical variables? If you use gcm.fit it would already automatically encode categorical data accordingly, no need to do this manually.
@bloebp , I haven't used gcm, but I've worked extensively with the dowhy API. In my use case, I want to perform causal analysis on many treatments over the same outcome variables, and the system should be scalable enough to handle concurrent queries. For example, I'm looking for the causal effect values for a treatment variable named Campaign Name, which has 6 to 7 campaigns, vs the outcome variable "Sales". So I OneHotEncode the data manually and limit the number of unique values for variables to ten because onehotencoding causes the curse of dimensionality.
I am curious if it is possible to serve requests concurrently if I provide the whole categorical data with 50-100 unique values directly to fit. Also, would this be scalable, or will it cause large memory consumption difficulties as dimensionality increases?
Ah ok got it. In this case, since you have particular target variable in mind, maybe you can check alternative encoding methods, such as CatBoostEncoder
. We have an implementation here: https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/catboost_encoder.py#L37
Basically what you can try is:
my_encoder = CatBoostEncoder()
df['MyCategoricalColumn'] = my_encoder.fit_transform(X=df['MyCategoricalColumn'].to_numpy().reshape(-1), Y=df['MyTargetVariable'].to_numpy().reshape(-1))
However, CatBooEncoder will encode the column called Campaign_Name within the same column, which is a categorical treatment variable, but what i need here is to get the ATE values for each campaign over the outcome variable "Sales".
If I do onehotencoding, I may retrieve the ATE value for each campaign because it is viewed as another treatment variable against the outcome "Sales".
Is it possible to obtain the ATE values for each Campaign in the Campaign_Name variable using catBoost in GCM?
Ah ok, yea the intervention value becomes rather abstract if you do a catboost encoding, while you still have a clear interpretation in one-hot-encodings.
So, in case of GCM, you can explicitly set the campaign value and see the effect, the transformation (e.g. catboost) would then happen internally. You can check this: https://www.pywhy.org/dowhy/v0.11.1/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html
Basically, you need to compare two reference treatments, like
gcm.average_causal_effect(causal_model,
'Sales',
interventions_alternative={'Campaigns': lambda x: 'MyFirstCampaign'},
interventions_reference={'Campaigns': lambda x: 'MySecondCampaign'},
num_samples_to_draw=1000)
Generally, using DML for effect estimation might be more robust than a GCM, but you can give it a shot.
@bloebp @amit-sharma I tried to run the Online Sales Shop example, which is available here: https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_online_shop.html.
auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data_2021, override_models=True, quality=gcm.auto.AssignmentQuality.GOOD); print(auto_assignment_summary)
This code method is running for hours and hours, with no output. Why is this so? Is this the intended behaviour? Also, the gcm.evaluate_causal_model method isn't working for me.
I also have a question: if I have a causal graph, should I explicitly apply causal mechanisms to each node before using it in gcm? If so, what are all the possible distributions that I should be able to set? Is there a reference for understanding causal mechanisms?
Version information: