Add new GCM model evaluation module

This module adds a new method for evaluating a fitted gcm. Here, we evaluate the performance of causal mechanisms, the underlying modeling assumptions (if possible), the goodness of the generated joint distribution and the graph structure. This utilizes some of the existing methods, but also introduces new ones.

This further adds a new user guide and notebook entryies demonstrating the usage.

Part of this changes required some modification in other modules as well. Mostly improvements and fixes.

Example output of the evaluation method:

Evaluated the performance of the causal mechanisms and the invertibility assumption of the causal mechanisms and the overall KL divergence between generated and observed distribution and graph structure. The results are as follows:

==== Evaluation of Causal Mechanisms ====
Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
Non-root nodes are evaluated based on the Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.

--- Node X: The KL divergence between generated and observed distribution is 0.023392158937329276.
The estimated KL divergence indicates an overall very good representation of the data distribution.

--- Node Y: The CRPS of this node is 0.5756805641901143.
The estimated CRPS indicates a fair model performance. Note, however, that a high CRPS could also result from a small signal to noise ratio.
The mechanism is better or equally good than all 6 baseline mechanisms.

--- Node Z: The CRPS of this node is 0.5564727188864227.
The estimated CRPS indicates a fair model performance. Note, however, that a high CRPS could also result from a small signal to noise ratio.
The mechanism is better or equally good than all 6 baseline mechanisms.

==== Evaluation of Invertible Functional Causal Model Assumption ====

--- The model assumption for node Y is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
This implies that the model assumption might be valid.

--- The model assumption for node Z is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
This implies that the model assumption might be valid.

Note that these results are based on statistical independence tests, and the fact that the assumption was not rejected does not necessarily imply that it is correct. There is just no evidence against it.

==== Evaluation of Generated Distribution ====
The overall KL divergence between the generated and observed distribution is 0
The estimated KL divergence indicates an overall very good representation of the data distribution.

==== Evaluation of the Causal Graph Structure ====
+-------------------------------------------------------------------------------------------------------+
|                                         Falsificaton Summary                                          |
+-------------------------------------------------------------------------------------------------------+
| The given DAG is not informative because 2 / 6 of the permutations lie in the Markov                  |
| equivalence class of the given DAG (p-value: 0.33).                                                   |
| The given DAG violates 0/1 LMCs and is better than 66.7% of the permuted DAGs (p-value: 0.33).        |
| Based on the provided significance level (0.05) and because the DAG is not informative,               |
| we do not reject the DAG.                                                                             |
+-------------------------------------------------------------------------------------------------------+

==== NOTE ====
Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms.

py-why / dowhy

Add new GCM model evaluation module #1051