pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Footprints issue #337

Closed rommeldias closed 2 years ago

rommeldias commented 2 years ago

Hi,

I have a particular question concerning footprints. I have an event log whose petri net is discovered by inductive miner with noise_threshold set to 0.0.

I use the following instructions to discover precedence relations of activities in both event log and petri net:

fp_net = footprints_discovery.apply(net, im, fm) fp_log = footprints_discovery.apply(event_log, variant=footprints_discovery.Variants.ENTIRE_EVENT_LOG) precedence_net = fp_net['sequence'] precedence_log = fp_log['sequence']

Shouldn't (precedence_log - precedence_net) be an empty set? I think that, since the fitness of inductive miner with no threshold is always 100%, every precedence relation of the event log is also observed in the petri net.

In my code, the operation (precedence_log - precedence_net) returns a set with length > 0.

Thank you very much. Rommel.

fit-alessandro-berti commented 2 years ago

Good morning. Thanks for the question.

I would invite you to check if you are already using the 2.2.23 release. In this release, we correct a minor problem with footprints discovery on process tree. Moreover, we add support for a new footprints visualizer (comparison)

If a sequence A->B is allowed by the log, then it should also be allowed by the model discovered by inductive miner. However, the model discovered by inductive miner may allow for more behavior, e.g., A and B could be in a flower-loop, hence both A->B and B->A are possible. In this example, A->B would belong to "sequence" in the log but to "parallel" in the model. So when computing the difference between the two footprints, the best choice is to perform the union between "sequence" and "parallel" of the log, and of the model, and compute the difference between them.

In alternative, you could exploit the new visualizer as in the example below, to observe in detail the differences between the two footprints.

It might also be that the inductive miner has some bug. In this case, I would invite you to perform alignment-based fitness which is the standard (but slow) method to verify if the model allows for all the behavior contained in the log.

Example code:

import pm4py log = pm4py.read_xes("C:/running-example.xes") tree = pm4py.discover_process_tree_inductive(log) from pm4py.algo.discovery.footprints import algorithm as footprints_discovery fp_log = footprints_discovery.apply(log, variant=footprints_discovery.Variants.ENTIRE_EVENT_LOG) fp_tree = footprints_discovery.apply(tree, variant=footprints_discovery.Variants.PROCESS_TREE) diff = (fp_log["sequence"].union(fp_log["parallel"])).difference(fp_tree["sequence"].union(fp_tree["parallel"])) from pm4py.visualization.footprints import visualizer gviz = visualizer.apply(fp_log, fp_tree, variant=visualizer.Variants.COMPARISON_SYMMETRIC, parameters={"format": "svg"}) visualizer.view(gviz)

rommeldias commented 2 years ago

Hi, Alessandro. First of all, thank you very much for the quick answer! I have just upgraded pm4py to the 2.2.23 release.

If a sequence A->B is allowed by the log, then it should also be allowed by the model discovered by inductive miner. However, the model discovered by inductive miner may allow for more behavior, e.g., A and B could be in a flower-loop, hence both A->B and B->A are possible. In this example, A->B would belong to "sequence" in the log but to "parallel" in the model. So when computing the difference between the two footprints, the best choice is to perform the union between "sequence" and "parallel" of the log, and of the model, and compute the difference between them.

Thank you for this insight! Just a question: if the log presents the sequence A->B only, then the model would never allow B->A, right? The model would allow for more behavior if the log allows too.

In alternative, you could exploit the new visualizer as in the example below, to observe in detail the differences between the two footprints. It might also be that the inductive miner has some bug. In this case, I would invite you to perform alignment-based fitness which is the standard (but slow) method to verify if the model allows for all the behavior contained in the log.

Thanks! I will check it out.

fit-alessandro-berti commented 2 years ago

Dear rommeldias, unfortunately process discovery algorithms are imprecise. Models discovered by inductive miner have generally low precision, which means that they allow for much more behavior than the one observed in the log. To give you an estimation of the unprecision problem, for the models automatically discovered by IM on some logs the model allows for 10 times more behavior than the log (e.g, the allowed footprints are 1000 against the 100 in the log). It is possible that only A->B is observed in the log, however due to the other directly-follows relations, it may happen that inductive miner detects them as flower.

rommeldias commented 2 years ago

Thank you, Alessandro!