n14b commented 1 year ago

We have large dataset divided over multiple files, how can we generate nets for data distributed over multiple files?

fit-alessandro-berti commented 1 year ago

Find the following example. You will need to replace the list of paths that are picked as Parquet files.

import pm4py
import pandas as pd
from collections import Counter
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.dfg.obj import DFG

paths_to_parquets = ["C:/roadtraffic.parquet", "C:/roadtraffic.parquet"]
overall_paths = Counter()
overall_start_activities = Counter()
overall_end_activities = Counter()
for file_path in paths_to_parquets:
dataframe = pd.read_parquet(file_path)

dataframe = dataframe_utils.legacy_parquet_support(dataframe)

# use the following if the case ID, activity or timestamp columns in the Parquet  
# are not standard  
dataframe = pm4py.format_dataframe(dataframe)  
paths, start_act, end_act = pm4py.discover_dfg(dataframe)  
for pa in paths:  
    overall_paths[pa] += paths[pa]  
for sa in start_act:  
    overall_start_activities[sa] += start_act[sa]  
for ea in end_act:  
    overall_end_activities[ea] += end_act[ea]

dfg_object = DFG(overall_paths, overall_start_activities, overall_end_activities)
print(overall_paths, overall_start_activities, overall_end_activities)
process_tree = pm4py.discover_process_tree_inductive(dfg_object)
print(process_tree)
petri_net, initial_marking, final_marking = pm4py.convert_to_petri_net(process_tree)
pm4py.view_petri_net(petri_net, initial_marking, final_marking, format="svg")

First, the DFGs for the single Parquets are summed. Then, a process tree is discovered from the DFG object. Then, the process tree is converted to a Petri net and the representation is shown.

n14b commented 1 year ago

can we use heiristic miner instead of inductive ?

pm4py / pm4py-core

How to use pm4py for large datasets divided over multiple files? #440

dataframe = dataframe_utils.legacy_parquet_support(dataframe)