pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Support on Root Cause analysis #450

Closed NKMatha closed 11 months ago

NKMatha commented 12 months ago

Hi team,

Am not clear with decision trees generated for Root Cause analysis. Can some please explain what does it representing for my data. image image

Thnaks.

fit-alessandro-berti commented 12 months ago

For the "decision tree" visualization of Scikit-Learn, in general you should interpret it as follows:

  1. Nodes: These are the "decision" points where the data is split.
  2. Root Node: This is the topmost node of the tree where the first split is made.
  3. Internal Node: These are the nodes where subsequent splits happen.
  4. Leaf Node/Terminal Node: Nodes at the bottom of the tree where a prediction is made, with no further splitting.
  5. Edges or Branches: These are the lines that connect nodes, representing the flow from one question to the next.
  6. Splitting Criteria: This is the condition based on which the split has been made at a particular node. It usually involves a feature and a threshold value.
  7. Gini Impurity: A measure of purity used at the node. The Gini Impurity ranges from 0 (all elements at the node belong to the same class) to 0.5 (elements are randomly distributed across various classes). In case of a binary classification, it's 0.5, for multiple classes, it can go higher.
  8. Samples: The number of observations in the dataset that reach that node.
  9. Value: The class distribution of the samples at that node. For example, if you are predicting three different classes, the value might look like [34, 10, 5], which means there are 34 samples of class 1, 10 samples of class 2, and 5 samples of class 3.
  10. Class: The class that is the most prevalent in that node. In a leaf node, this represents the predicted class for samples that end up in that leaf.
fit-alessandro-berti commented 12 months ago

The case-level features of pm4py are generally computed using one-hot encoding for categorical values, and reporting the last numeric value of numeric features.

NKMatha commented 11 months ago

Thanks @fit-alessandro-berti