Feature relevance/Influence

py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

https://www.pywhy.org/dowhy

MIT License

7.15k stars 935 forks source link

Feature relevance/Influence #1169

Closed ankur-tutlani closed 4 months ago

ankur-tutlani commented 7 months ago

I have a dataset with some X and Y features. Y is target node and X represents nodes. I have got the assessment from evaluate_causal_model that the KL divergence is low and DAG is not rejected. I want to know the contribution of X nodes to Y. X represents couple of nodes for which I want to know the contribution to Y. I am using fit_and_compute along with confidence_intervals to get some range of values. I have a dataset which contains all categorical features (X) and categorical target (Y) (not binary). But I can think about creating the dataset with numeric features X and numeric target in case that helps in addressing what I am looking for below.

I am evaluating 3 possible functions which I think can answer. arrow_strength, intrinsic_causal_influence and parent_relevance. I understand the underlying logic is different what each can capture. I am trying to make them as comparable as possible using gcm.divergence.auto_estimate_kl_divergence inside difference_estimation_func (arrow_strength) or attribution_func (intrinsic_causal_influence ) or subset_scoring_func (parent_relevance). This is inspired by response on this issue Quantifying Arrow Strength - What Values Are Considered Strong/Weak? · Issue #574 · py-why/dowhy · GitHub.I am able to run and got some results. But the results from all 3 are very different. Results from arrow_strength gives positive values (18-19), while intrinsic_causal_influence and parent_relevance provides negative values. Results from intrinsic_causal_influence gives values of -9 as the lowest, but for all nodes values are negative. Similarly, for parent_relevance values are negative and similar to intrinsic_causal_influence. The lowest value is about -8.4 for the noise and for the edges about -5. I assume specifying KL divergence should result in positive values. Does it make sense to use divergence in the way I specified in all 3 cases or only make sense in arrow_strength? I saw in the source code, the default scoring function (when nothing is specified) use functions from gcm.uncertainty or gcm.divergence or gcm.util.general.

What would be the interpretation of arrow_strength results which provides positive values (18-19) as output. The documentation says removing the edge say from X1 > X2 with value of 19, increases variance of Y by 19 units, how to interpret this in the context of categorical target node? Is there any way to get total KL divergence of target node? If yes, and assume the number is 100, can we say 18/100 = 18% of KL divergence in target is explained by edge X1>X2 or it does not make sense?

I also tried running intrinsic_causal_influence and parent_relevance with their default values, without specifying any value for attribution_func and subset_scoring_func. And this time the output of all is between 0 and 1. For parent_relevance, the noise value came out to be negative (-0.01). What does these values mean? How to interpret this in the context of explaining causal relation of X features with target node (Y). Is there any way to quantify, X1 contributes 10% to Y , X2 5% etc. and may be some % is unexplained by any of the X features or noise?

Above is for overall dataset. Similar thing I want to achieve for individual samples also. I am trying to understand the contribution of factors which can explain any causal relation with target node Y. I came across these 2 functions which can help achieve that, intrinsic_causal_influence_sample and feature_relevance_sample. I came across attribute_anomalies also, but I assume that sets some threshold for outlier which I don’t know of. Since I have categorical target node, so intrinsic_causal_influence_sample is not applicable, and the only choice left is feature_relevance_sample. This function requires specifying prediction_method. But that requires training a separate ML model which can give better results. My objective is not the prediction (which is my interpretation of training an ML), but the contribution of factors into the observed value of Y for individual samples. Do you still think it makes sense to use this function as shap considers the deviation of average actual from prediction? And if yes, what would be the interpretation of output from running this function as all these are deviation from prediction of individual sample with the average of overall dataset?

Also, is there any way to get the contribution of target itself to the target for individual samples, like the way we get from running parent_relevance (noise relevance) and intrinsic_causal_influence? If not, say from overall (aggregate) analysis we get that 80% of variation is caused by noise/target, what can it infer something for the individual samples from that aggregate data? Say for an individual sample we say there are 4 features, and all contribute about 25% each. Can we say this 25% contribution of each of these X factors into variation of Y explains only about 20% variation in Y, since 80% variation in Y (on average) is unexplained by any of these X features?

In nutshell, I am trying to quantify the contribution of X features to Y. And what % of Y variation is unexplained by any of the X nodes. I want to do this for overall (global) dataset and for local individual samples. Contribution meaning in terms of any causal strength of X to Y. Or if we say X1 causes Y, what is the confidence score on this statement? Or if Y increases by 5% in a month, how much of that 5% is caused by X1, X2, .. etc.

Version information:

DoWhy version [0.11.1]

bloebp commented 7 months ago

Hey, thanks a lot for the detailed question! It is a lot, let me try to break it down further:

I have a dataset with some X and Y features. Y is target node and X represents nodes. I have got the assessment from evaluate_causal_model that the KL divergence is low and DAG is not rejected. I want to know the contribution of X nodes to Y. X represents couple of nodes for which I want to know the contribution to Y. I am using fit_and_compute along with confidence_intervals to get some range of values. I have a dataset which contains all categorical features (X) and categorical target (Y) (not binary). But I can think about creating the dataset with numeric features X and numeric target in case that helps in addressing what I am looking for below.

Generally, most methods work with categorical targets. The only exception is if you want to perform counterfactuals (instead of interventions) with non-root categorical nodes. There, we currently assume to estimate a point estimate for the counterfactual, but one would need a distribution. In your case, however, it seems that having a categorical target (and upstream nodes) is fine.

I am evaluating 3 possible functions which I think can answer. arrow_strength, intrinsic_causal_influence and parent_relevance. I understand the underlying logic is different what each can capture. I am trying to make them as comparable as possible using gcm.divergence.auto_estimate_kl_divergence inside difference_estimation_func (arrow_strength) or attribution_func (intrinsic_causal_influence ) or subset_scoring_func (parent_relevance). This is inspired by response on this issue https://github.com/py-why/dowhy/issues/574.I am able to run and got some results. But the results from all 3 are very different.

As you mentioned, the main issue is probably that they inherently capture different aspects and I see the difficulty in choosing what one actually wants. While you can definitely try to use the same measure to make the quantities more comparable, I am not sure if a 1:1 comparison makes much sense here. Without diving too much into the details of the difference, maybe on a high level, you can take a look at the qualitative comparison (e.g., compare the ranking of the features between the approaches and see which is closer to what you would intuitively expect).

Maybe quick overview:

arrow_strength: Measures the direct edge between two nodes (ignoring all indirect paths). Here, we only look at how much a node contributes to the child node directly, but do not distinguish between how much the node just propagates information from their parents. For instance, X -> Y -> Z, if Y := X (Y is just a copy of X), it will still have a large direct influence on Z.
intrinsic_causal_influence: Measures the contribution to the target node (e.g., entropy) by incorporating direct and indirect paths (e.g., if the influence is over a different node as well, not just directly). Here, however, it only measures the new information that the node adds. Taking the same example, X -> Y -> Z with Y := X, then Y would get 0 influence on Z, because it is just propagating the influence of X. Here, the intrinsic part focuses on the new information that has been added. So, if we have e.g., Y := X + noise, then the noise here is the new information added by Y, and this is the contribution on Z that is estimated. Here, we can also incorporate the noise from Z itself that has been added.
parent_relevance: This is the 'typical' Shapley feature relevance estimation that you also do in SHAP, just here we add the noise of the target node as an explicit feature, i.e., the relevance estimated here focuses on the actual discriminative information of the parents, while aspects that cannot be explained are attributed to the noise of the target.

For the attribution functions, you can use the KL divergence for arrow strength and the negative entropy of the probabilities in the case of the intrinsic contribution (e.g. -estimate_entropy_of_probabilities(x) from the uncertainty module). It's important here that if it is an uncertainty measure, it should be the negative of it (so, -estimate_entropy_of_probabilities instead of estimate_entropy_of_probabilities). This is because it compares larger with smaller subsets and in the case of uncertainty measures, the smaller subsets get larger uncertainty than larger subsets. It will ultimately just flip the sign of the contributions, but then you would get positive ones instead of negative ones. In the case of feature relevance, the meaning becomes quite different. Here, you are estimating how much (on average) a feature contributes to achieving the (conditional) mean of the measure of the target. Let's say for variance, you would estimate the contribution to Var[Y|x] (or for the particular function parents_relevance, it is to E_X[Var[Y|X]] = Var[Y]). Here, you can also use entropy, but you should be aware that this is different from the direct and intrinsic influence. The (probably not really accurate) interpretation of parent_relevance with entropy would be a version of direct arrow strength that also incorporates indirect paths via other features.

Results from arrow_strength gives positive values (18-19), while intrinsic_causal_influence and parent_relevance provides negative values. Results from intrinsic_causal_influence gives values of -9 as the lowest, but for all nodes values are negative. Similarly, for parent_relevance values are negative and similar to intrinsic_causal_influence. The lowest value is about -8.4 for the noise and for the edges about -5. I assume specifying KL divergence should result in positive values. Does it make sense to use divergence in the way I specified in all 3 cases or only make sense in arrow_strength? I saw in the source code, the default scoring function (when nothing is specified) use functions from gcm.uncertainty or gcm.divergence or gcm.util.general.

Do you have many deterministic relationships in there? You can definitely use these measures as mentioned above. Just the values have different meanings with respect to what kind of attribution they are capturing.

What would be the interpretation of arrow_strength results which provides positive values (18-19) as output. The documentation says removing the edge say from X1 > X2 with value of 19, increases variance of Y by 19 units, how to interpret this in the context of categorical target node?

In the case of categorical data, it uses KL divergence, i.e., it measures how much information X1 contributes to X2. So, the arrow strength indicates how much the distributions would change when removing the direct influence, which should give insights into the relevance of it. While KL divergence has many well-defined interpretations, it is always a bit abstract and I unfortunately don't have an easier answer there. That is also why we use variance in the numerical case since this is much easier to explain/interpret.

Is there any way to get total KL divergence of target node? If yes, and assume the number is 100, can we say 18/100 = 18% of KL divergence in target is explained by edge X1>X2 or it does not make sense?

KL divergence is the difference between two distributions. However, I think you want to just have a percentage of how much an edge contributes. In the case of direct arrow strength, this only makes sense if you assume the relationship is deterministic (i.e., no noise). Then what you can do is: (arrow strength of X1) / (sum over all arrow strengths)

I also tried running intrinsic_causal_influence and parent_relevance with their default values, without specifying any value for attribution_func and subset_scoring_func. And this time the output of all is between 0 and 1. For parent_relevance, the noise value came out to be negative (-0.01). What does these values mean? How to interpret this in the context of explaining causal relation of X features with target node (Y). Is there any way to quantify, X1 contributes 10% to Y , X2 5% etc. and may be some % is unexplained by any of the X features or noise?

If the noise value has a small value (-0.01), then it is a good indicator that it is deterministic (i.e., the parents seem to 'perfectly' explain your target). The noise generally indicates the amount that is not explained by the other features (parents). Since we use variance/entropy by default here, you can get the percentages as mentioned before: take the attribution of node Xi and divide it by the sum over all. This approach just doesn't work if the attributions are positive and negative.

Above is for overall dataset. Similar thing I want to achieve for individual samples also. I am trying to understand the contribution of factors which can explain any causal relation with target node Y. I came across these 2 functions which can help achieve that, intrinsic_causal_influence_sample and feature_relevance_sample. I came across attribute_anomalies also, but I assume that sets some threshold for outlier which I don’t know of.

attribute_anomalies is for attributing anomalous values to upstream nodes. You don't need to know a threshold, since it uses a (continuous) scale and an anomaly scorer invariant approach for this. That being said, it only makes sense for explaining anomalous observations, which is probably not your use case.

Since I have categorical target node, so intrinsic_causal_influence_sample is not applicable, and the only choice left is feature_relevance_sample. This function requires specifying prediction_method. But that requires training a separate ML model which can give better results. My objective is not the prediction (which is my interpretation of training an ML), but the contribution of factors into the observed value of Y for individual samples. Do you still think it makes sense to use this function as shap considers the deviation of average actual from prediction? And if yes, what would be the interpretation of output from running this function as all these are deviation from prediction of individual sample with the average of overall dataset?

For single samples when it is categorical, then only feature_relevance_sample would be left. Note that parent_relevance is also just using feature_relevance_sample but with a prediction_methodthat takes the noise as an explicit feature. And then averages these over multiple samples. That being said, it makes sense to use the 'typical' SHAP approach, but by default, the difference from the expectation cannot just be averaged over the datasets (since the attributions would average to 0). Therefore, we are using the variance (or you can also use the squared differences). So, I think it can make sense to take a look at these relevances as long as the prediction model is a causal one (i.e., only causal upstream nodes of the target are used). That being said: feature_relevance_sample is basically the same as SHAP, but with more flexibility in configuration (and potentially much faster for non-tree based models).

Also, is there any way to get the contribution of target itself to the target for individual samples, like the way we get from running parent_relevance (noise relevance) and intrinsic_causal_influence? If not, say from overall (aggregate) analysis we get that 80% of variation is caused by noise/target, what can it infer something for the individual samples from that aggregate data? Say for an individual sample we say there are 4 features, and all contribute about 25% each. Can we say this 25% contribution of each of these X factors into variation of Y explains only about 20% variation in Y, since 80% variation in Y (on average) is unexplained by any of these X features?

You might need a modified version of parent_relevance for this, where you take the causal mechanism (that takes the input features and the noise) and use this as the prediction model for feature_relevance_sample. Then you get the relevance for features and the noise for a specific sample. However, note that you might not be able to have it in percentage (depending on the attribution method).

In nutshell, I am trying to quantify the contribution of X features to Y. And what % of Y variation is unexplained by any of the X nodes. I want to do this for overall (global) dataset and for local individual samples. Contribution meaning in terms of any causal strength of X to Y. Or if we say X1 causes Y, what is the confidence score on this statement? Or if Y increases by 5% in a month, how much of that 5% is caused by X1, X2, .. etc.

So, I think it boils down to what kind of contribution you are interested in. Just taking the example: X -> Y -> Z with Y := X being a copy of X, then you only need Y to predict Z and can ignore X (i.e., Y has super high relevance), but if you are interested in how much new information Y contributes or which feature you would need to intervene on to change Z, then X is probably the most relevant node.

That being said, your use case also sounds related to: https://github.com/py-why/dowhy/blob/2cd8e8147ba9809171e4299c603ca741ff380bfb/dowhy/gcm/unit_change.py, maybe you can take a look at this module.

Hope this gives some more insights/understanding.

ankur-tutlani commented 7 months ago

Apologies for a long question and thanks much for your detailed explanation. Some follow-up questions if you can address, please.

• In the case of intrinsic_causal_influence, if we get all values as positive and I assume it uses entropy for categorical target as the default (estimate_entropy_of_probabilities) for attribution_func. To get the individual X values contribution in % values, we need to add the corresponding influence from target as well while calculating the total which can’t be explained by any of the features?

• For feature_relevance_sample function, what should be the subset_scoring_func in case of categorical target node assuming non-binary? means_difference don’t make sense here. Although I tried with means_difference, and able to match the sum of individual shap values with the difference between prediction of sample and average of overall dataset (target is label encoded). Any other function which can help achieve something similar for categorical target where we can match sum of contribution of individual shap values with delta between individual prediction and overall average? I tried with gcm.divergence.estimate_kl_divergence_categorical, and got some results, e.g. array([[ 0.32080822, -0.10049755, 1.1483414 , 0.1367922 ]]) but how to tie those individual values to difference between prediction and total average like possible in case of means_difference? I somehow not able to run gcm.divergence.estimate_kl_divergence_of_probabilities, when specified with model.predict_proba in the prediction_method. Any recommendation you can suggest here?

• What if the target is ordinal (integers with an ordering)? I assume its treated as Discrete in the model. Will it be treated as continuous for the purpose of feature_relevance_sample or intrinsic_causal_influence_sample? If yes, what is the recommendation on using attribute_func and subset_scoring_fun?

• Thanks for sharing the unit_change link. I glanced through it, but it seems like it works when there is deterministic relation between Y and X. Although there is a section in the appendix in the paper which talks about stochastic relation, but it does not seem to be integrated yet in unit_change? Is this understanding correct?

bloebp commented 7 months ago

In the case of intrinsic_causal_influence, if we get all values as positive and I assume it uses entropy for categorical target as the default (estimate_entropy_of_probabilities) for attribution_func. To get the individual X values contribution in % values, we need to add the corresponding influence from target as well while calculating the total which can’t be explained by any of the features?

Yes, in the case of intrinsic_causal_influence, the influence of the target itself represents the unexplained part. So, if everything is positive, it is fair to get the percentage of a feature by taking its contribution divided by the sum over all (including the target). In this regard, if, e.g., nothing can be explained by the features, then the target should get the sole contribution (and the features have all 0%, while the target itself has 100%).

For feature_relevance_sample function, what should be the subset_scoring_func in case of categorical target node assuming non-binary? means_difference don’t make sense here. Although I tried with means_difference, and able to match the sum of individual shap values with the difference between prediction of sample and average of overall dataset (target is label encoded). Any other function which can help achieve something similar for categorical target where we can match sum of contribution of individual shap values with delta between individual prediction and overall average? I tried with gcm.divergence.estimate_kl_divergence_categorical, and got some results, e.g. array([[ 0.32080822, -0.10049755, 1.1483414 , 0.1367922 ]]) but how to tie those individual values to difference between prediction and total average like possible in case of means_difference? I somehow not able to run gcm.divergence.estimate_kl_divergence_of_probabilities, when specified with model.predict_proba in the prediction_method. Any recommendation you can suggest here?

In the case of categorical targets, you can use variance_of_matching_values (from dowhy.gcm.util.general import variance_of_matching_values) as the attribution function here. The idea is that if a feature is important, perturbing its values (i.e., randomizing them) is likely to result in predictions that differ significantly from the baseline values (i.e., predicted labels based on non-perturbed inputs). In this case, the variance of matching values will be high, indicating a strong influence of the feature on the predictions. Conversely, if a feature is less important, perturbing its values may not have a substantial impact on the predictions, resulting in a lower variance of matching values. This can be seen as the contribution to the (conditional) variance.

With regards to using entropy, maybe this unit test here can give some inspiration: https://github.com/py-why/dowhy/blob/2cd8e8147ba9809171e4299c603ca741ff380bfb/tests/gcm/test_feature_relevance.py#L160 Note, however, that this only makes sense at the population level, not for a single sample.

What if the target is ordinal (integers with an ordering)? I assume its treated as Discrete in the model. Will it be treated as continuous for the purpose of feature_relevance_sample or intrinsic_causal_influence_sample? If yes, what is the recommendation on using attribute_func and subset_scoring_fun?

Yes, the underlying model would be discrete, but for the attribution functions, it is treated as continuous. While a function that directly exploits the discrete values is probably more suitable, treating it simply as continuous here should be fine qualitatively. That being said, you also pass in an entropy version explicitly that expects the values to be discrete: https://github.com/py-why/dowhy/blob/2cd8e8147ba9809171e4299c603ca741ff380bfb/dowhy/gcm/uncertainty.py#L83

Thanks for sharing the unit_change link. I glanced through it, but it seems like it works when there is deterministic relation between Y and X. Although there is a section in the appendix in the paper which talks about stochastic relation, but it does not seem to be integrated yet in unit_change? Is this understanding correct?

I am not super familiar with the unit_change module, but I assume it focuses on (deterministic) prediction models. However, what you could always do is to treat the noise of the model as an explicit input (similar to parent_relevance) to treat the non-deterministic causal mechanisms as deterministic. This is, fit a causal mechanism and then make a "custom model" that expects, e.g., the noise as a feature as well (see e.g. https://github.com/py-why/dowhy/blob/2cd8e8147ba9809171e4299c603ca741ff380bfb/dowhy/gcm/feature_relevance.py#L95). Otherwise, I think the unit_change is rather focusing on determining whether input features or the underlying mechanisms have changed. Now thinking about it, I am not sure if this is really relevant for your use case.

ankur-tutlani commented 7 months ago

Thanks, Patrick, for your response. A few more clarifications, if you can answer please. Apologies for frequent questions.

• When using feature_relevance_sample function along with subset_scoring_func as variance_of_matching_values, I am getting some values with negative sign with categorical target. What do these negative values indicate since the output of variance_of_matching_values is -np.var((randomized_predictions == baseline_values).squeeze()) so I assume it should result positive values only? Although I am getting it for small % of data (~ 10%) and the average magnitude is -0.01 to -0.02. Does it mean presence of the feature decreases the likelihood of observed target value? When trying to answer, say top n features for a sample, the feature with negative values should be ranked as least important and the ones with the higher positive value should be marked as most important. Is this correct way to interpret?

• Is the output from feature_relevance_sample function comparable amongst themselves and across different samples assuming subset_scoring_func as variance_of_matching_values and categorical target? Say for sample1, I get these values {X1: 0.05, X2:0.02}. Sample2, {X1:0.02, X2:0.07}. Can we say in sample1, X1 contribution is 71% (0.05/0.07) and X2 contribution is 29% (0.02/0.07)? or X1 is 2.5 times (0.05/0.02) more effective than X2 in sample 1? What would happen in case of negative values?

• Continuing the same example as above, can this be comparable across different samples? Can we say average causal contribution of X1 is 0.035 {average of (X1 contribution in sample 1) 0.05, (X1 contribution in sample 2) 0.02} and average causal contribution of X2 is 0.045 {average of 0.02 (sample1),0.07 (sample2)}? Or does it make sense to use some other aggregation function like sum? Assume that sample1 and sample 2 are in baseline_samples and baseline_samples are a subset of data provided in feature_samples. I agree for aggregated data, we have parent_relevance and other functions but say if I am only looking at these 2 samples and may not have leverage to run parent_relevance for 2 samples. You mentioned that these values are a contribution towards conditional variance, do you mean conditional variance of target provided in feature_samples? Is the output from feature_relevance_sample with baseline_samples (assume more than 1 sample) represent a contribution towards same conditional variance of target in feature_samples?

bloebp commented 7 months ago

When using feature_relevance_sample function along with subset_scoring_func as variance_of_matching_values, I am getting some values with negative sign with categorical target. What do these negative values indicate since the output of variance_of_matching_values is -np.var((randomized_predictions == baseline_values).squeeze()) so I assume it should result positive values only? Although I am getting it for small % of data (~ 10%) and the average magnitude is -0.01 to -0.02. Does it mean presence of the feature decreases the likelihood of observed target value? When trying to answer, say top n features for a sample, the feature with negative values should be ranked as least important and the ones with the higher positive value should be marked as most important. Is this correct way to interpret?

These small negative values are most likely due to the stochastic component of the estimation here. Generally, even with totally irrelevant features, the difference $v(S \cup {i}) - v(S)$ of the set functions (here, $v$ would be the negative variance, $S$ a subset of features and $i$ is e.g., an irrelevant feature) should be 0. However, computing $v(S \cup {i})$ and $v(S)$ involves randomized samples, which could lead to $v(S) > v(S \cup {i})$, i.e., the contribution becomes negative.

That being said, for this particular set function, I would probably treat these small negative values as 0 for the sake of ranking/interpretation.

Is the output from feature_relevance_sample function comparable amongst themselves and across different samples assuming subset_scoring_func as variance_of_matching_values and categorical target? Say for sample1, I get these values {X1: 0.05, X2:0.02}. Sample2, {X1:0.02, X2:0.07}. Can we say in sample1, X1 contribution is 71% (0.05/0.07) and X2 contribution is 29% (0.02/0.07)? or X1 is 2.5 times (0.05/0.02) more effective than X2 in sample 1? What would happen in case of negative values?

The variance should generally be comparable. Not sure if I would use the term "effective", but maybe relevant? Basically, it indicates that the particular value for X1 in sample 1 has (much?) more influence in changing the output if perturbed, i.e., it should be more relevant for it. While this flips in sample 2.

The percentages are OK like this, and, as discussed above, you could set the negative values to 0 for this particular set function. The scale here is quite small, so it might be important to double-check how significant these values are (e.g., using the bootstrap API to get a median over multiple runs and a 95% confidence interval).

Continuing the same example as above, can this be comparable across different samples? Can we say average causal contribution of X1 is 0.035 {average of (X1 contribution in sample 1) 0.05, (X1 contribution in sample 2) 0.02} and average causal contribution of X2 is 0.045 {average of 0.02 (sample1),0.07 (sample2)}? Or does it make sense to use some other aggregation function like sum? Assume that sample1 and sample 2 are in baseline_samples and baseline_samples are a subset of data provided in feature_samples. I agree for aggregated data, we have parent_relevance and other functions but say if I am only looking at these 2 samples and may not have leverage to run parent_relevance for 2 samples. You mentioned that these values are a contribution towards conditional variance, do you mean conditional variance of target provided in feature_samples? Is the output from feature_relevance_sample with baseline_samples (assume more than 1 sample) represent a contribution towards same conditional variance of target in feature_samples?

If you take the average here, it is of course only with respect to these two samples. The important point here is that the baseline samples should represent the population (i.e., the baseline samples should not contain a selection bias). Now looking at this particular set function again, it is strictly speaking not the conditional variance, but rather the variance of the delta function (that has the baseline and perturbed predictions as input). Alternatives to this are something like:

Instead of the variance, take the mean of the delta function: -np.mean((randomized_predictions == baseline_values).squeeze())
You can make use of the computed probabilities for the categories more directly by taking the L1 difference between the probabilities. Here, the predictions should be predict_probabilities (a vector of probabilities for each category) and not just the category itself: -np.mean(np.abs(randomized_predictions - baseline_values)) (not sure if some reshaping is required here, didn't test the code directly). Note that this set function only makes sense for single samples, not when averaging.

These might give you some scales that are not that close to 0. As you see, each of these set functions quantifies the relevance with respect to different things. Qualitatively, they might be quite similar in the rankings of the features; however, quantitatively (i.e., x1 is % relevant), it might make some important differences. Sorry for not giving a more definite answer here. I would probably go with the entropy set function using the predicted category probabilities (as referenced in the previous answer).

Keep in mind, the contribution is always with respect to $v(\text{full subset}) - v(\emptyset)$. For example, in case of the L1 difference and a single sample x, this means:

$v(\text{full subset}) = ||P(Y \mid x) - P(Y)||_\text{L1}$
$v(\emptyset) = ||P(Y \mid \emptyset) - P(Y)||_\text{L1} = 0$
$v(\text{full subset}) - v(\emptyset) = ||P(Y \mid x) - P(Y)||_\text{L1}$

The strict interpretation here is the contribution to the quantity $||P(Y \mid x) - P(Y)||_\text{L1}$. The quantities $P(Y)$ and $P(Y \mid x)$ are computed based on the baseline samples. Here, you also see why not all set functions are applicable for averaging over multiple samples. If you would average it over the whole population, then the average L1 differences add up to 0 (due to $\mathbb{E}_X[P(Y \mid X) - P(Y)] = P(Y) - P(Y) = 0$).

ankur-tutlani commented 7 months ago

Thanks again for the response and detailed explanation. Can you please validate if below functions are correct to use in subset_scoring_func? I am trying to get some ranking of features for a sample. Preferably quantitative, else qualitative if not possible. Using feature_relevance_sample function with a categorical target.

def variance_of_means_test2(randomized_predictions: np.ndarray, baseline_values: np.ndarray) -> np.ndarray:
    return -np.mean(np.abs(randomized_predictions - baseline_values))

Output values are positive when used with model.predict_proba. Assuming we can take the individual X1 value /sum(all X) as the % contribution? Is this correct?

Using entropy with probability vectors, you mentioned above that makes sense for population, not for individual sample. However, I tried with below using model.predict_proba:

def estimate_entropy_of_probabilities2(X: np.ndarray) -> float:
    return float((entropy(X, axis=1)))

def my_entropy_feature_attribution_function4(
    randomized_predictions: np.ndarray, baseline_predictions: np.ndarray
) -> float:

    h_p_Y = estimate_entropy_of_probabilities2(baseline_predictions.reshape(1, -1))
    h_p_Y_do_xs = estimate_entropy_of_probabilities2(randomized_predictions.reshape(1, -1))

return np.abs(h_p_Y_do_xs - h_p_Y)

Are the two functions above, correct? In the output, there are negative values for some features, and these are large (-0.2, -0.3). How to get ranking (qualitative/ quantitative) in this context? I am assuming averaging across samples does not make sense for this set function also. Is this correct?

bloebp commented 7 months ago

Output values are positive when used with model.predict_proba. Assuming we can take the individual X1 value /sum(all X) as the % contribution? Is this correct?

Wondering if this is always positive for all samples? On a population level, the values should actually average to 0 (it should only make sense for single samples), but I might have missed something. Otherwise, yes, the percentage via the sum should be fine.

Are the two functions above, correct?

Double-check whether baseline_predictions and randomized_predictions are truly only one sample. If it is really one sample, it should be fine. If there are multiple samples, then the reshape would lead to a very long vector of probabilities since all samples are concatenated there, which does not make sense.

That being said, we are actually looking at the entropy based on the probability vector of a given sample, i.e., it should also be a valid set function for a single sample.

For the ranking with negative values, I would generally consider the absolute value of negative contributions here since a perturbation of that particular feature still leads to a change in the statistic you are interested in. Therefore, it is relevant for predicting that outcome. That being said, I think the easiest and most robust set function should be the L1 difference (absolute difference between the probability vectors).

ankur-tutlani commented 6 months ago

Thanks for your response.

Wondering if this is always positive for all samples?

Yes, these are positive for majority of baseline_samples (~ 550 in count). There are 4 features, and out of 4 only for 1 feature there are ~ 10 records with negative value. Also, the average value (output from running feature_relevance_sample using variance_of_means_test2 as subset_scoring_func as per defined above) for each of the features in entire baseline_samples is ~ 0.007 and sum is ~ 4 - 4.5. Is this expected or low?

Double-check whether baseline_predictions and randomized_predictions are truly only one sample.

Do you mean running feature_relevance_sample with any 1 baseline_samples at a time or multiple samples together inside baseline_samples? I am getting exact same results in both the scenarios (running sample1 individually versus sample1 as one of the samples passed together in baseline_samples). This is using feature_relevance_sample using my_entropy_feature_attribution_function4 as subset_scoring_func per defined above. Is this expected?

bloebp commented 6 months ago

Yes, these are positive for majority of baseline_samples (~ 550 in count). There are 4 features, and out of 4 only for 1 feature there are ~ 10 records with negative value. Also, the average value (output from running feature_relevance_sample using variance_of_means_test2 as subset_scoring_func as per defined above) for each of the features in entire baseline_samples is ~ 0.007 and sum is ~ 4 - 4.5. Is this expected or low?

These are all positive for the variance set function (not the absolute difference)? In case of the variance, that is expected. A value of 0.007, however, seems to be quite low, even for the variance of a binary variable. How are the results when you take the average absolute difference of the probabilities?

Do you mean running feature_relevance_sample with any 1 baseline_samples at a time or multiple samples together inside baseline_samples? I am getting exact same results in both the scenarios (running sample1 individually versus sample1 as one of the samples passed together in baseline_samples). This is using feature_relevance_sample using my_entropy_feature_attribution_function4 as subset_scoring_func per defined above. Is this expected?

What I meant was to simply add a print statement (print(baseline_samples.shape, randomized_predictions.shape)) or so into the set function to see if the shape is 1xc or nxc where c is the number of categories. It should be 1xc for that paricular set function.

Just to clarify, you mean passing sample1 alone vs having it as part of a batch of feature_samples? The baseline_samples should always be your training/all observed data (or a (uniformly smapled) subset to reduce computation time).

ankur-tutlani commented 6 months ago

These are all positive for the variance set function (not the absolute difference)? In case of the variance, that is expected. A value of 0.007, however, seems to be quite low, even for the variance of a binary variable. How are the results when you take the average absolute difference of the probabilities?

These values are from using set function which takes the mean of absolute difference between randomized predictions and baseline predictions. These are from using categorical target node. It is using below function.

def variance_of_means_test2(randomized_predictions: np.ndarray, baseline_values: np.ndarray) -> np.ndarray:
    return -np.mean(np.abs(randomized_predictions - baseline_values))

Just to clarify, you mean passing sample1 alone vs having it as part of a batch of feature_samples? The baseline_samples should always be your training/all observed data (or a (uniformly smapled) subset to reduce computation time).

I meant passing single sample or in batches inside baseline_samples? Based on the code and following link, I assumed baseline_samples are for which we want feature_relevance. I am using training data inside feature_samples and a subset of it inside baseline_samples. Is this correct?

https://www.pywhy.org/dowhy/v0.11.1/user_guide/causal_tasks/root_causing_and_explaining/feature_relevance.html#how-to-use-it

https://github.com/py-why/dowhy/blob/333c9a8afaa76dbfeabf1f01e326143aa3f6ce4d/dowhy/gcm/feature_relevance.py#L221

And I checked the following line inside feature_relevance_sample function, so seems it is passing individual samples to subset_scoring_func.

https://github.com/py-why/dowhy/blob/333c9a8afaa76dbfeabf1f01e326143aa3f6ce4d/dowhy/gcm/feature_relevance.py#L268

bloebp commented 6 months ago

These values are from using set function which takes the mean of absolute difference between randomized predictions and baseline predictions. These are from using categorical target node. It is using below function.

But the predictions here are the probabilities (predict_proba), right? If they are the categorical values, then this would not make sense, since we would then assume they are ordinal.

I meant passing single sample or in batches inside baseline_samples? Based on the code and following link, I assumed baseline_samples are for which we want feature_relevance. I am using training data inside feature_samples and a subset of it inside baseline_samples. Is this correct?

Ah yes, sorry, you are right! feature_samples are population samples, and baseline_samples are the samples to estimate the relevance for.

ankur-tutlani commented 6 months ago

But the predictions here are the probabilities (predict_proba), right? If they are the categorical values, then this would not make sense, since we would then assume they are ordinal.

Correct, these are probability values (output from predict_proba). But the relevance values are very low and does not seem to be adding up to ~ 0 for population using this function. Anything which could explain this?

def variance_of_means_test2(randomized_predictions: np.ndarray, baseline_values: np.ndarray) -> np.ndarray:
    return -np.mean(np.abs(randomized_predictions - baseline_values))

bloebp commented 6 months ago

I think the set function is missing a sum. So, we want to get the (average) L1 distance (between vectors), but currently, it is only the average of the absolute differences of each dimension for each row. Therefore, it should be: np.mean(np.sum(np.abs(randomized_predictions - baseline_values), axis=1))

That being said, since these are single samples, we don't even need the mean. Just in case, you can double check if np.sum(np.abs(randomized_predictions - baseline_values), axis=1) is a scalar.

ankur-tutlani commented 6 months ago

Thanks, that helps. However, there are few questions from intrinsic_causal_influence (population relevance) if you can answer, please.

I did not observe much difference in results from intrinsic_causal_influence when running with causal_model type as StructuralCausalModel or InvertibleStructuralCausalModel. Is this expected? Running for population only and not individual sample.
Ran two iterations using intrinsic_causal_influence with the default attribution_func (assuming its -estimate_entropy_of_probabilities(x) ) with categorical target node.

Iteration 1. Few nodes (say 5), I get some values say X1=0.5,X2=0.5,X3=0.4,X4=0.2,X5=0.1, Target (Unexplained) = 0.9
Iteration 2. With larger nodes (14) and edges and this includes the earlier nodes also (X1,X2, X3, X4 and X5). In iteration 2 I am getting similar value of Target ~0.85 – 0.9. But the values of other nodes (X1, X2…) are decreased significantly. This may be expected due to higher node count in Iteration 2.
Sum of values from X1 to X5 nodes (all nodes excluding Target ) in iteration 1 = 0.5+0.5+0.4+0.2+0.1 = 1.7. However, the sum of nodes excluding target node in Iteration 2 is significantly less around 0.3 (0.3 is the sum of all nodes excluding target in iteration 2). My understanding was with the larger count of nodes, the 1.7 contribution should spread to other nodes which are large in numbers in iteration 2, but the overall contribution (sum from other nodes) excluding target still should remain similar as in iteration1 since the value for Target node (unexplained) remained similar in both the iterations. What can explain this divergence in contribution from rest of the nodes except target? Please note there are some edges which are different in both these iterations for the nodes which are common (X1, X2, X3, X4, X5). But in both the iterations, all X features are directed towards target node. There are some other edges too which shows connection among X features.
1. If we specify the graph where all the feature nodes (X1, X2..) are connected only to target node. Say 5 features, then 5 edges in the graph from each of X feature nodes towards the target node, in that case majority of times I get the message that DAG is rejected (from gcm.evaluate_causal_model). What can explain this? Continuing from previous example, despite getting the message of DAG is rejected, I still ran intrinsic_causal_influence. In that case the contribution of target node (unexplained) increased substantially to ~ 2.4. And for rest of the nodes, the contribution became very small ~ 0.0001. Any explanation for this?

bloebp commented 6 months ago

I did not observe much difference in results from intrinsic_causal_influence when running with causal_model type as StructuralCausalModel or InvertibleStructuralCausalModel. Is this expected? Running for population only and not individual sample.

Generally, a StructuralCausalModel implies that the causal mechanism of a node has the form $Y := f(PA_Y, N_Y)$ ($PA_Y$ are the parents of $Y$), without further restrictions on the function $f$. An InvertibleStructuralCausalModel, on the other hand, assumes that the function $f$ is invertible with respect to the noise $N_Y$, i.e., the relationship between $(PA_Y, Y)$ and $N_Y$ is (strictly) monotonic. Seeing this, the StructuralCausalModel is more general, but does not ensure that you can estimate (point-wise) counterfactuals since you need to reconstruct the noise $N_Y$. Therefore, for all functions utilizing the noise (like ICC), you need an InvertibleStructuralCausalModel. That being said, since it is Python, you can also pass in a StructuralCausalModel object as long as the underlying causal mechanisms are invertible FCMs. I assume you are using the auto assignment function? In that case, we would use an additive noise model ($Y := f(PA_Y, N_Y) = g(PA_Y) + N_Y$) by default, which is invertible.

Long story short, InvertibleStructuralCausalModel is a special case of a StructuralCausalModel, and if you use the default (auto) assignments, both model classes will still use an additive noise model (i.e., it is invertible). And since Python does not enforce strict type matching, you can also pass a StructuralCausalModel model (which in this case when using the auto assignment is equivalent to an InvertibleStructuralCausalModel).

Sum of values from X1 to X5 nodes (all nodes excluding Target ) in iteration 1 = 0.5+0.5+0.4+0.2+0.1 = 1.7. However, the sum of nodes excluding target node in Iteration 2 is significantly less around 0.3 (0.3 is the sum of all nodes excluding target in iteration 2). My understanding was with the larger count of nodes, the 1.7 contribution should spread to other nodes which are large in numbers in iteration 2, but the overall contribution (sum from other nodes) excluding target still should remain similar as in iteration1 since the value for Target node (unexplained) remained similar in both the iterations. What can explain this divergence in contribution from rest of the nodes except target? Please note there are some edges which are different in both these iterations for the nodes which are common (X1, X2, X3, X4, X5). But in both the iterations, all X features are directed towards target node. There are some other edges too which shows connection among X features.

You are right that the whole sum (including the target) should remain the same with more variables. Just in case, double-check if the sum over all nodes with including the target also drastically changes. That being said, the ICC method estimates the contribution based on the provided GCM. Therefore, if you add/remove nodes/edges, the underlying causal mechanism will look different, and the larger the graph becomes, some model inaccuracies can happen when modeling each node. However, the sum over all contributions should still be comparable since, in the worst case where no upstream node has any influence on the target, it should all be attributed to the target's noise.

Two things here:

Can you re-run these setups using the confidence intervals with refitting the GCM? Here is an example: https://www.pywhy.org/dowhy/v0.11.1/user_guide/modeling_gcm/estimating_confidence_intervals.html#conveniently-bootstrapping-graph-training-on-random-subsets-of-training-data (needs to be adjusted to the gcm.intrinsic_causal_influence accordingly). This should reduce the model fitting randomness.
You can modify the parameters of the ICC method. It is currently tuned to be 'moderately fast' (still slow for larger graphs, of course). But that requires sampling a lot and using a heuristic for the Shapley value computation. Here, you can also try changing these parameters (e.g., set prediction_model to 'exact', increase num_samples_randomization and num_samples_baseline, etc.). This should give more accurate results.

If we specify the graph where all the feature nodes (X1, X2..) are connected only to target node. Say 5 features, then 5 edges in the graph from each of X feature nodes towards the target node, in that case majority of times I get the message that DAG is rejected (from gcm.evaluate_causal_model). What can explain this?

For the rejection, I guess it depends on how the ground truth graph looks like. Can you paste the whole message you get there? Or check the example here to investigate further which tests led to the rejection: https://www.pywhy.org/dowhy/v0.11.1/example_notebooks/gcm_falsify_dag.html

In that case the contribution of target node (unexplained) increased substantially to ~ 2.4. And for rest of the nodes, the contribution became very small ~ 0.0001. Any explanation for this?

Is it exactly the same data as in the previous example? Or a different one? So, even if you omit all links between the features and point them directly to the target, it should not become 0 everywhere when they had some (significant) non-zero contribution before. These contributions basically imply that they have no influence on the target and seem rather independent.

ankur-tutlani commented 6 months ago

Thanks much. I tried falsify_graph function and got below results. These results are from running below command assuming a categorical target node and feature nodes.

falsify_graph(causal_graph, data, plot_histogram=True,suggestions=True)

I am getting two different results. In one case (Iter 1) with p-value of 0.25, it says based on significance level (0.05), we reject the DAG. In Iter 1, there are edges connecting feature nodes to target and also some edges connecting features among themselves.

In another case (Iter 2), with p-value of 1, it says we do not reject the DAG. But in Iter 2, only direct edges (from feature node to target) are present. There is no edge connecting features among themselves.

Also, I have seen in the case of Iter 2, there is sometimes conflicting results from Falsification summary. Sometimes it says DAG is not informative, and sometimes it says DAG is informative. Something like below. Above it says DAG is not informative, below it says its informative.

Can we still proceed with Iter 2 DAG which only has direct edges from feature nodes to target if the objective is to get valid conclusions (ranking of features) from intrinsic_causal_influence (for population) and feature_relevance_sample (for individual samples)?

When we get the message of Do not reject DAG, do we still need to remove edges which are shown in suggestions? Will that have any impact on quality of results (either from iter 1 or iter 2) apart from may be higher run time with more edges?

When we get the message of reject DAG, can we still proceed with intrinsic_causal_influence and feature_relevance_sample provided overall KL divergence is low? Will there be any impact on the quality of results, meaning ranking of features explaining what is more / or less relevant? I checked two iterations where I only keep direct edges (from X nodes to target only) and when I keep both direct edges in addition to edges connecting features among themselves. I get the similar results in terms of sum of values from X nodes and target in both iterations. Target node contribution is similar, other nodes' contribution varies depending on the edges, which is expected, but different from 0 in both cases. You mentioned above that running feature_relevance_sample makes sense only when prediction model is causal (include nodes which are causal in nature) so does it mean when DAG is rejected, results from feature_relevance_sample won’t be as reliable?

Can we run intrinsic_causal_influence with all nodes (only direct edges from X to target) , and see the results (confidence intervals), and then run prediction model with only those features which showed significance from intrinsic_causal_influence results? And then pass that model object inside prediction_method in feature_relevance_sample? Can we rely on those results from intrinsic_causal_influence and feature_relevance_sample executing this way when the objective is to get some ranking both at population level and individual sample level? Assuming we would use L1 difference as the subset_scoring_func in feature_relevance_sample. For intrinsic_causal_influence, can we rely on the contribution from noise and non-noise (together) since the individual node’s contribution may vary depending on the graph and edges?

I have tried following functions inside subset_scoring_func of feature_relevance_sample. Both these functions provide scaler value for a single sample. Predictions are from predict_proba.

def l1difference(randomized_predictions: np.ndarray, baseline_values: np.ndarray) -> np.ndarray:
    return -np.mean(np.sum(np.abs(randomized_predictions - baseline_values), axis=1))

def l1difference_mod(randomized_predictions: np.ndarray, baseline_values: np.ndarray) -> np.ndarray:
    return -(np.sum(np.abs(randomized_predictions - baseline_values)))

I find in case of first function (with mean), it provides output in the range primarily from 0 to 1. While the second function provides output in larger numbers in the range from 500-1500 for few samples I checked. But the ranking is similar from both, meaning take the value of X1 and divide by sum (all X). In % terms it gives similar values. But I have few samples (~10% of dataset) where I get negative values. You mentioned above that in case of entropy we can take the absolute value and take % contribution. Does it hold for this function as well? Can we take absolute value of rows showing negative values and calculate the % contribution?

Is it exactly the same data as in the previous example? Or a different one?

You are correct, it was different one. My mistake. Sorry.

bloebp commented 6 months ago

Can we still proceed with Iter 2 DAG which only has direct edges from feature nodes to target if the objective is to get valid conclusions (ranking of features) from intrinsic_causal_influence (for population) and feature_relevance_sample (for individual samples)?

In case of only having directed edges, you might not even care about the correctness of the DAG (it is most likely wrong anyway due to missing links between the features). However, you do not necessarily need a "correct" DAG for feature_relevance_sample as long as the causal order is correct here (which only boils down to the target needing to be an effect of the features and not a cause). In case of intrinsic_causal_influence, it is more important.

Also, I have seen in the case of Iter 2, there is sometimes conflicting results from Falsification summary. Sometimes it says DAG is not informative, and sometimes it says DAG is informative.

Generally, these are based on random sampling and the minimum required number of permutations. You could increase the number of permutations n_permutations to, e.g., 100. This might give more consistent results. But as mentioned before, you might not need the falsification in the first place.

When we get the message of Do not reject DAG, do we still need to remove edges which are shown in suggestions? Will that have any impact on quality of results (either from iter 1 or iter 2) apart from may be higher run time with more edges?

Without knowing more about the specific DAG and data, this is hard to say. Generally, any modification can influence the results. If the edge is truly irrelevant, then it would improve the statistical problem.

When we get the message of reject DAG, can we still proceed with intrinsic_causal_influence and feature_relevance_sample provided overall KL divergence is low? Will there be any impact on the quality of results, meaning ranking of features explaining what is more / or less relevant? I checked two iterations where I only keep direct edges (from X nodes to target only) and when I keep both direct edges in addition to edges connecting features among themselves. I get the similar results in terms of sum of values from X nodes and target in both iterations. Target node contribution is similar, other nodes' contribution varies depending on the edges, which is expected, but different from 0 in both cases. You mentioned above that running feature_relevance_sample makes sense only when prediction model is causal (include nodes which are causal in nature) so does it mean when DAG is rejected, results from feature_relevance_sample won’t be as reliable?

For feature_relevance_sample, as mentioned above, it is more important that the target is not a cause of the features to have a valid causal interpretation. In the case of intrinsic_causal_influence, a wrong DAG could give certain nodes higher/lower contributions than they actually have. That being said, this is all from a theoretical perspective. If these links between features are rather weak, the difference between a correct and the simplified DAG is probably rather small. As a sanity check, you can randomly add/remove edges and see how the results (i.e. contributions) change. This gives you some insights into the sensitivity.

Can we run intrinsic_causal_influence with all nodes (only direct edges from X to target) , and see the results (confidence intervals), and then run prediction model with only those features which showed significance from intrinsic_causal_influence results? And then pass that model object inside prediction_method in feature_relevance_sample? Can we rely on those results from intrinsic_causal_influence and feature_relevance_sample executing this way when the objective is to get some ranking both at population level and individual sample level? Assuming we would use L1 difference as the subset_scoring_func in feature_relevance_sample. For intrinsic_causal_influence, can we rely on the contribution from noise and non-noise (together) since the individual node’s contribution may vary depending on the graph and edges?

This, again, boils down to what kind of quantity you are interested in. If you are purely interested in "which features are the most important ones for the prediction," then I would not use the intrinsic_causal_influence rankings at all and only look at the feature_relevance rankings. If your objective is to see which variables add new information to the target, then rather look at intrinsic_causal_influence. That being said, in the case of "All X point to Y directly and there are no other edges," there is probably not much difference between the rankings here. The difference is more significant in DAGs where the features are connected with each other.

I find in case of first function (with mean), it provides output in the range primarily from 0 to 1. While the second function provides output in larger numbers in the range from 500-1500 for few samples I checked. But the ranking is similar from both, meaning take the value of X1 and divide by sum (all X). In % terms it gives similar values. But I have few samples (~10% of dataset) where I get negative values. You mentioned above that in case of entropy we can take the absolute value and take % contribution. Does it hold for this function as well? Can we take absolute value of rows showing negative values and calculate the % contribution?

The first one with the mean should be the correct one. The second one is taking the sum over the absolute differences, but it is not estimating the L1 distance of the vectors. However, it could be that it doesn't make much difference ranking-wise.

When you run this on the whole population (100% of the dataset), are there still very few negative values?

ankur-tutlani commented 6 months ago

Thanks. I have tried few results with intrinsic_causal_influence to get population relevance and have got few questions.

I have observed results are very sensitive towards num_bootstrap_resamples and num_permutations (inside ShapleyConfig). I am using fit_and_compute with intrinsic_causal_influence. For num_bootstrap_resamples as 5, the median value shows 30% as contribution of target node (unexplained). But with same data, running with num_bootstrap_resamples as 20, the median value shows 56% as contribution of target node (unexplained). For both these results, num_permutations value is 10. Similar, is the case with num_permutations, results vary drastically. The default values for these parameters are set high, but that taking long time to run, hence running with lower values. I have categorical target and features with 10 categories in each and 8 nodes. Any recommendations you can give on the optimum parameters? I tried with bootstrap_sampling, but that results not matching with fit_and_compute in the sense the % unexplained contribution of target node is very different.

Another iteration I tried with say 10 features all directly connected to node Y (target). The unexplained contribution (median value from fit_and_compute) % is 67%. Tried another iteration with 7 features all directly connected to node Y (target), but now the unexplained contribution (median value from fit_and_compute) % is reduced to 30%. This is with num_permutations = 10 and num_bootstrap_resamples = 5 for both the iterations. What can explain this? The sum of raw values output from intrinsic_causal_influence is similar in both the cases, but the % distribution changes. I was expecting with reduced features, the unexplained % contribution should increase or stays constant? But it went down. Tried the same iteration with reduced features, but with higher num_bootstrap_resamples = 20, now the median % unexplained contribution is 56%. I am using prediction_model as “exact” for all iterations. What can explain this difference in results? What should be the approach to get some stable output?

bloebp commented 6 months ago

have observed results are very sensitive towards num_bootstrap_resamples and num_permutations (inside ShapleyConfig). I am using fit_and_compute with intrinsic_causal_influence. For num_bootstrap_resamples as 5, the median value shows 30% as contribution of target node (unexplained). But with same data, running with num_bootstrap_resamples as 20, the median value shows 56% as contribution of target node (unexplained). For both these results, num_permutations value is 10. Similar, is the case with num_permutations, results vary drastically. The default values for these parameters are set high, but that taking long time to run, hence running with lower values. I have categorical target and features with 10 categories in each and 8 nodes. Any recommendations you can give on the optimum parameters? I tried with bootstrap_sampling, but that results not matching with fit_and_compute in the sense the % unexplained contribution of target node is very different.

The issue here is that there are potentially 9^2 possible subsets (8 features + target itself), while num_permutations indicates that you only look at 10 of them, which can still yield very good insights, but is of course a very small sample size. So, I think to have less variance in the result, you should consider increasing num_permutations (or use the early stopping method). It is unfortunately slow, but aside from playing around with these parameters (e.g., you can also try to reduce num_noise_feature_samples and increase max_batch_size while increasing num_permutations), I don't really have a good suggestion.

That being said, if you increase the num_permutations to a high number (or even use the EXACT Shapley approximation technique), you can also avoid using the bootstrapping, which might also help with the runtime (in case the results have less variance).

Another iteration I tried with say 10 features all directly connected to node Y (target). The unexplained contribution (median value from fit_and_compute) % is 67%. Tried another iteration with 7 features all directly connected to node Y (target), but now the unexplained contribution (median value from fit_and_compute) % is reduced to 30%. This is with num_permutations = 10 and num_bootstrap_resamples = 5 for both the iterations. What can explain this? The sum of raw values output from intrinsic_causal_influence is similar in both the cases, but the % distribution changes. I was expecting with reduced features, the unexplained % contribution should increase or stays constant? But it went down. Tried the same iteration with reduced features, but with higher num_bootstrap_resamples = 20, now the median % unexplained contribution is 56%. I am using prediction_model as “exact” for all iterations. What can explain this difference in results? What should be the approach to get some stable output?

Your understanding is correct, the unexplained part (i.e., the contribution of the target) should increase (or stay the same) with fewer features. Seeing the small number of permutations, I am wondering if the variance of the computed values is simply too high here. If you have the time, maybe once try to run all permutations (e.g., choose EXACT for the ShapleyApproximationMethod in the ShapleyConfig) and see if these differences are still the same. It will probably take some time, but it might give an insight if these differences are simply due to the high variances of the estimated values.

ankur-tutlani commented 6 months ago

Your understanding is correct, the unexplained part (i.e., the contribution of the target) should increase (or stay the same) with fewer features. Seeing the small number of permutations, I am wondering if the variance of the computed values is simply too high here. If you have the time, maybe once try to run all permutations (e.g., choose EXACT for the ShapleyApproximationMethod in the ShapleyConfig) and see if these differences are still the same. It will probably take some time, but it might give an insight if these differences are simply due to the high variances of the estimated values.

I have tried two iterations with no bootstrapping (num_bootstrap_resamples = 1). This is using intrinsic_causal_influence function with “EXACT” shapley approximation method. prediction_model also is “exact” for both cases. Both iterations have all nodes as categorical (target as well as feature nodes). At the most, there are 10 categories in each of these categorical features. In both the iterations, only feature nodes are connected with target node. There is no node connected among features themselves. Input data is same for both the iterations.

In iteration1, there are 8 nodes (features) directly connected to target node. The target contribution towards target is 56%. In another iteration 2, there are 13 nodes (features) directly connected to target node. In this case, target contribution towards target is 77%. For both the iterations, the sum of raw values output across target node and feature nodes is similar. Is there anything which can explain the higher unexplained contribution in iteration 2 with higher number of nodes? Also, the run time for iteration 1 was ~30 mins, for iteration 2 was ~ 26.5 hours. Is it expected the run time to increase this much with the addition of 5 more features?

bloebp commented 6 months ago

In iteration1, there are 8 nodes (features) directly connected to target node. The target contribution towards target is 56%. In another iteration 2, there are 13 nodes (features) directly connected to target node. In this case, target contribution towards target is 77%. For both the iterations, the sum of raw values output across target node and feature nodes is similar. Is there anything which can explain the higher unexplained contribution in iteration 2 with higher number of nodes? Also, the run time for iteration 1 was ~30 mins, for iteration 2 was ~ 26.5 hours. Is it expected the run time to increase this much with the addition of 5 more features?

Wow, 26 hours is a lot. Thank you for taking the time to look into that. It seems the difference is still similar to before with the approximations being much faster (so we can go back to that).

Regarding the differences, are the 8 features in iteration 1 also 1:1 part of the 13 features for iteration 2? If they are exactly the same features and the target did not change, it is really surprising that the target contribution itself increases. Two points I can think of:

The learning problem became harder with the addition of the other 5 features, and there might not be enough training samples. How many training samples do you have?
Another point could be that since each node is categorical, we need many more 'baseline' (num_samples_baseline) and 'randomization' (num_samples_randomization) samples. This is because two "similar" input vectors (e.g., a single categorical value changes) can lead to drastically different outputs compared to numerical features where we typically assume that similar vectors lead to similar outputs. Can you try reducing the categorical values to around 2 per feature just to check? The performance will definitely decrease, but I am wondering if the difference between 8 vs. 13 features would be (much) smaller. Alternatively, you can incrase num_samples_baseline and num_samples_randomization, but I don't have a concrete number and it will also make it slower.

ankur-tutlani commented 5 months ago

Thanks. I have tried few more results.

Dataset size is about 580. Firstly, I started with at most 10 categories, and the results look different when tried with fit_and_compute with resamples as 5 versus when running intrinsic_causal_influence directly. Unexplained contribution is around 70% when running directly, but with fit_and_compute (5 resamples) the unexplained contribution decreases. Sometimes the results from intrinsic_causal_influence running again also varies meaning lower unexplained contribution by manually fitting on the entire dataset and storing the results. I am using EXACT shapley value method with default values of num_samples_baseline, max_batch_size and num_samples_randomization. I have another dataset with bigger size (~ 8000), there I observed opposite trend. Running directly with intrinsic_causal_influence shows lower unexplained contribution while with fit_and_compute shows higher unexplained contribution. In fit_and_compute I am using the same parameters as intrinsic_causal_influence (all default with shapley as EXACT). Although, this 580 dataset is not subset from 8000. I only have direct edges connected from individual feature node to target node. There are total 5 feature nodes and 1 target node all categorical.

Next, I reduced the categories to at most 4 categories for all nodes and re-ran. This time manually rerunning intrinsic_causal_influence on entire dataset gave consistent results majority of times. Similarly, when run inside fit_and_compute multiple times, it gave consistent results. But results vary drastically from intrinsic_causal_influence inside fit_and_compute versus running intrinsic_causal_influence without it. The fit_and_compute shows lower unexplained contribution as compared to when intrinsic_causal_influence is called directly. This is on the 580 dataset and using same parameters as above. The unexplained contribution is much higher with 4 categories (in % terms) when compared with 10 categories results. Although in raw values, the sum of all values (including target node) is lower with 4 categories (1.38) as compared to 10 categories (2.29). How to go about in this case? Which results should be trusted? Although results from intrinsic_causal_influence look more in alignment with domain/business understanding which show higher unexplained contribution.

Above results are with 5 features. Now I reduced features to 3 removing 2 features. I kept the max categories to 4 in all nodes including the target node. With 4 categories, the unexplained contribution looks similar to when there were 5 features with 4 categories in them. Although its not matching when there were 5 features with at most 10 categories in them. Here too, results look similar when manually rerun with entire dataset majorly. But in some cases results do differ when re-run meaning showing lower unexplained contribution. When these differ, the execution time is also lower. Anything which can explain this behaviour? Here also using same default parameters as above using EXACT shap value.

I replicated the same thing with fit_and_compute and 5 resamples. Here also pattern is similar. Unexplained contribution is lower as compared to running without fit_and_compute. But here results vary. The unexplained median contribution differs significantly from the unexplained median contribution when there were 5 features. Unexplained median contribution is higher now as compared to when there were 5 features. Both having 4 categories at the max.

What should be your recommendation to go about assuming looking at unexplained and factors contribution in % is the goal?

bloebp commented 5 months ago

Dataset size is about 580.

That seems fairly low with this amount of many categorical values. It might also explain the high variance between two runs (reflected in the resample). Generally, bootstrapping re-fits the models and re-runs the algorithm, which should give better insights into how the "true" values look.

Next, I reduced the categories to at most 4 categories for all nodes and re-ran. This time manually rerunning intrinsic_causal_influence on entire dataset gave consistent results majority of times. Similarly, when run inside fit_and_compute multiple times, it gave consistent results. But results vary drastically from intrinsic_causal_influence inside fit_and_compute versus running intrinsic_causal_influence without it.

The variance is expected as fit_and_compute aims at reducing the variance between calls. Running intrinsic_causal_influence can vary significantly around the true values, especially in this underspecified (i.e., bad model fit) situation. The fit_and_compute might be the better route here.

The unexplained contribution is much higher with 4 categories (in % terms) when compared with 10 categories results.

That is expected as it becomes much more difficult to explain the target with only 4 categories compared to 10 categories. The question was rather, does having 13 variables with 4 categories still have more unexplained than with 7 variables and 4 categories? Here, we would expect the opposite.

Although in raw values, the sum of all values (including target node) is lower with 4 categories (1.38) as compared to 10 categories (2.29). How to go about in this case? Which results should be trusted? Although results from intrinsic_causal_influence look more in alignment with domain/business understanding which show higher unexplained contribution.

I would probably not take the raw values directly here but rather look at the "rankings." For instance, you can have a really bad model that just always returns a constant, then your variance is even 0, but the model itself is bad. So, reducing the categories would definitely lead to worse predictions, but it was more about seeing if we get more consistent results with respect to the variance between runs. If they are more consistent, then it is a good indicator that there are too few training samples (since having 4 categories needs less data than having 10). But this does not imply that the model with 4 categories is better than the one with 10.

Above results are with 5 features. Now I reduced features to 3 removing 2 features. I kept the max categories to 4 in all nodes including the target node. With 4 categories, the unexplained contribution looks similar to when there were 5 features with 4 categories in them. Although its not matching when there were 5 features with at most 10 categories in them. Here too, results look similar when manually rerun with entire dataset majorly. But in some cases results do differ when re-run meaning showing lower unexplained contribution. When these differ, the execution time is also lower. Anything which can explain this behaviour? Here also using same default parameters as above using EXACT shap value.

It could be that in the cases where it is slower, the assigned models are different (e.g., a non-linear model instead of a linear one). You can inspect the prediction model of the causal mechanisms of the nodes to see what changes there. I highly suspect that it uses linear models in the "faster" cases and non-linear models in the slow cases. You can also assign linear models manually to enforce more consistency: https://www.pywhy.org/dowhy/v0.11.1/user_guide/modeling_gcm/customizing_model_assignment.html

E.g., causal_model.set_causal_mechanism('Y', ClassifierFCM(classifier_model=create_logistic_regression_classifier())) in case of a categorical node. Note that you only need to do this for nodes that have parents (i.e., non-root nodes).

So, my general recommendation is: Maybe try a linear model (in your case where everything is categorical, with a logistic regression model) and try to get as much data as possible.

ankur-tutlani commented 5 months ago

I have tried with logistic function with following two functions. First function is for smaller dataset, and second function for larger dataset. I tried two iterations, with resamples count as 5 or 10. I tried with when there are 5 factors versus when there are 2 factors. Also, when they have 10 categories at the most versus 4 categories at the most. It seems results more consistent with larger dataset and a smaller number of categories. But what if these conditions do not satisfy? Any recommendations assuming the end goal is to get some estimation of unexplained contribution and factors ranking with reasonable accuracy?

scm = gcm.InvertibleStructuralCausalModel(causal_graph)
auto_assignment_summary = gcm.auto.assign_causal_mechanisms(scm, data1_imputed, override_models=True, quality=gcm.auto.AssignmentQuality.BETTER)

from dowhy.gcm import ClassifierFCM
from dowhy.gcm.ml import create_logistic_regression_classifier
scm.set_causal_mechanism('Y', ClassifierFCM(classifier_model=create_logistic_regression_classifier(max_iter=1000)))

For 588 dataset:

strength_median, strength_intervals = gcm.confidence_intervals(
    gcm.fit_and_compute(gcm.intrinsic_causal_influence,
                        scm,
                        target_node='Y',
                        bootstrap_training_data=data1_imputed,
                        shapley_config=gcm.shapley.ShapleyConfig(num_permutations=10,n_jobs=-1),
                        auto_assign_quality=gcm.auto.AssignmentQuality.BETTER,
                        num_samples_randomization=50,
                        num_samples_baseline=100,
                        max_batch_size=50,
                        prediction_model="exact",
                        bootstrap_data_subset_size_fraction=0.75),
    num_bootstrap_resamples=5,
    n_jobs=1
    )

For 8418 dataset:

strength_median, strength_intervals = gcm.confidence_intervals(
    gcm.fit_and_compute(gcm.intrinsic_causal_influence,
                        scm,
                        target_node='Y',
                        bootstrap_training_data=data1_imputed,
                        shapley_config=gcm.shapley.ShapleyConfig(num_permutations=10,n_jobs=-1),
                        auto_assign_quality=gcm.auto.AssignmentQuality.BETTER,
                        num_samples_randomization=500,
                        num_samples_baseline=1000,
                        max_batch_size=250,
                        prediction_model="exact",
                        bootstrap_data_subset_size_fraction=0.75),
    num_bootstrap_resamples=5,
    n_jobs=1
    )

bloebp commented 5 months ago

I unfortunately don't have any further recommendations aside from keeping a few things in mind:

Many categorical values make the ML problem harder, i.e., the models become worse. You could also try much more powerful models (currently, we rather use standard scikit-learn models), but this would also make the computations much slower. For instance, you can try using https://auto.gluon.ai/ as the prediction models.
On the other hand, optimally, you shouldn't remove any categorical values as you will just lose information in theory, but it can make the ML problem easier. If you want to drop categories, make sure to drop the "less important ones" or even introduce a dummy category where you map multiple categories to (not sure if this makes sense in your case, however).
To speed up the bootstraps, you should also add approximation_method = ShapleyApproximationMethods.PERMUTATION as a parameter to the ShapleyConfig.
Generally, more bootstrap runs should make it more consistent and should give more "trustworthy" results. But it will unfortunately make it slower.
Categorical data makes the ICC computation very slow (I am currently looking into improving this, hopefully for the next release). I don't think that continuous data is possible in your case, but if you have a continuous representation that makes sense, then it would be faster and more consistent. Just don't simply convert the categorical values to numerical ones, that would not make much sense (except if there is a specific order).

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 14 days with no activity.