Closed aryadegari closed 1 month ago
I realized that after calling pm4py.format_dataframe
function, the resulting dataframe should remain intact. The solution mentioned above remains valid though, as grouped_df[activity_key].first()
only gives valid results if grouped_df[activity_key]
is sorted based on timestamps.
Another solution, however, is to call pm4py.format_dataframe
on the given dataframe in the beginning of functions.
Problem:
Functions to get start or end activities return wrong results if an unsorted Pandas dataframe is passed based on timestamps. The API links to the implementations are these:
Here I focus on getting start activities as an example.
The problem is with this line which assumes that
grouped_df[activity_key]
is sorted, which is not necessarily always the case:startact_dict = dict(Counter(grouped_df[activity_key].first().to_numpy().tolist()))
Replication Code:
Suggested Solution(s):
I suggest sorting the dataframe based on the timestamps before getting the first item of grouped rows based on activity names. I also suggest that the returning activities dictionary be sorted based on the frequency in descending order. By adding this line of code in the end of functions:
sorted_startact_dict = dict(sorted(startact_dict.items(), key=lambda x: x[1], reverse=True))
The new implementation of code for
get_start_activities
will look like below: