pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
GNU General Public License v3.0
702 stars 276 forks source link

Get start and end activities buggy with Pandas dataframe #500

Open aryadegari opened 2 weeks ago

aryadegari commented 2 weeks ago


Functions to get start or end activities return wrong results if an unsorted Pandas dataframe is passed based on timestamps. The API links to the implementations are these:

Here I focus on getting start activities as an example.

The problem is with this line which assumes that grouped_df[activity_key] is sorted, which is not necessarily always the case: startact_dict = dict(Counter(grouped_df[activity_key].first().to_numpy().tolist()))

Replication Code:

# Step 1: Install required libraries [pandas=2.2.2, pm4py=2.7.11]

import pandas as pd
import pm4py
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter

# Step 2: Create a sample DataFrame representing an event log
data = {
    'case:concept:name': ['1', '1', '1', '2', '2', '3', '3', '3'],  # Case ID
    'concept:name': ['Start', 'Activity A-Prestart', 'End', 'Start', 'End', 'Start', 'Activity B', 'End'],  # Activity
    'time:timestamp': pd.to_datetime([
        '2023-08-30 08:10:00',
        '2023-08-30 08:00:00',
        '2023-08-30 08:30:00',
        '2023-08-30 09:00:00',
        '2023-08-30 09:05:00',
        '2023-08-30 10:00:00',
        '2023-08-30 10:05:00',
        '2023-08-30 10:10:00'
    ])  # Timestamp

df = pd.DataFrame(data)

# Step 3: Get the start activities using pm4py
start_activities = pm4py.get_start_activities(event_log)

# Print the start activities
print("Start Activities:", start_activities)
# Output: Start Activities: {'Start': 3}

# Step 4: Testing - Check if the start activities are as expected
expected_start_activities = {'Start': 2}
assert start_activities == expected_start_activities, "Test failed: Start activities do not match expected output."
# Output: AssertionError: Test failed: Start activities do not match expected output.
print("Test passed!")

Suggested Solution(s):

I suggest sorting the dataframe based on the timestamps before getting the first item of grouped rows based on activity names. I also suggest that the returning activities dictionary be sorted based on the frequency in descending order. By adding this line of code in the end of functions: sorted_startact_dict = dict(sorted(startact_dict.items(), key=lambda x: x[1], reverse=True))

The new implementation of code for get_start_activities will look like below:

from pm4py.util.xes_constants import DEFAULT_TIMESTAMP_KEY

def get_start_activities(df: pd.DataFrame, parameters: Optional[Dict[Union[str, Parameters], Any]] = None) -> Dict[str, int]:

    if parameters is None:
        parameters = {}

    case_id_glue = exec_utils.get_param_value(Parameters.CASE_ID_KEY, parameters, CASE_CONCEPT_NAME)
    activity_key = exec_utils.get_param_value(Parameters.ACTIVITY_KEY, parameters, DEFAULT_NAME_KEY)
    timestamp_key = exec_utils.get_param_value(Parameters.TIMESTAMP_KEY, parameters, DEFAULT_TIMESTAMP_KEY)
    grouped_df = parameters[GROUPED_DATAFRAME] if GROUPED_DATAFRAME in parameters else None

    if grouped_df is None:
        sorted_df = df.sort_values(by=[timestamp_key])
        grouped_df = sorted_df.groupby(case_id_glue, sort=False)

    startact_dict = dict(Counter(grouped_df[activity_key].first().to_numpy().tolist()))
    sorted_startact_dict = dict(sorted(startact_dict.items(), key=lambda x: x[1], reverse=True))

    return sorted_startact_dict 
aryadegari commented 2 weeks ago

I realized that after calling pm4py.format_dataframe function, the resulting dataframe should remain intact. The solution mentioned above remains valid though, as grouped_df[activity_key].first() only gives valid results if grouped_df[activity_key] is sorted based on timestamps.

Another solution, however, is to call pm4py.format_dataframe on the given dataframe in the beginning of functions.