pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
702 stars 276 forks source link

Get start and end activities buggy with Pandas dataframe #500

Open aryadegari opened 2 weeks ago

aryadegari commented 2 weeks ago

Problem:

Functions to get start or end activities return wrong results if an unsorted Pandas dataframe is passed based on timestamps. The API links to the implementations are these:

Here I focus on getting start activities as an example.

The problem is with this line which assumes that grouped_df[activity_key] is sorted, which is not necessarily always the case: startact_dict = dict(Counter(grouped_df[activity_key].first().to_numpy().tolist()))

Replication Code:

# Step 1: Install required libraries [pandas=2.2.2, pm4py=2.7.11]

import pandas as pd
import pm4py
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter

# Step 2: Create a sample DataFrame representing an event log
data = {
    'case:concept:name': ['1', '1', '1', '2', '2', '3', '3', '3'],  # Case ID
    'concept:name': ['Start', 'Activity A-Prestart', 'End', 'Start', 'End', 'Start', 'Activity B', 'End'],  # Activity
    'time:timestamp': pd.to_datetime([
        '2023-08-30 08:10:00',
        '2023-08-30 08:00:00',
        '2023-08-30 08:30:00',
        '2023-08-30 09:00:00',
        '2023-08-30 09:05:00',
        '2023-08-30 10:00:00',
        '2023-08-30 10:05:00',
        '2023-08-30 10:10:00'
    ])  # Timestamp
}

df = pd.DataFrame(data)

# Step 3: Get the start activities using pm4py
start_activities = pm4py.get_start_activities(event_log)

# Print the start activities
print("Start Activities:", start_activities)
# Output: Start Activities: {'Start': 3}

# Step 4: Testing - Check if the start activities are as expected
expected_start_activities = {'Start': 2}
assert start_activities == expected_start_activities, "Test failed: Start activities do not match expected output."
# Output: AssertionError: Test failed: Start activities do not match expected output.
print("Test passed!")

Suggested Solution(s):

I suggest sorting the dataframe based on the timestamps before getting the first item of grouped rows based on activity names. I also suggest that the returning activities dictionary be sorted based on the frequency in descending order. By adding this line of code in the end of functions: sorted_startact_dict = dict(sorted(startact_dict.items(), key=lambda x: x[1], reverse=True))

The new implementation of code for get_start_activities will look like below:

from pm4py.util.xes_constants import DEFAULT_TIMESTAMP_KEY

def get_start_activities(df: pd.DataFrame, parameters: Optional[Dict[Union[str, Parameters], Any]] = None) -> Dict[str, int]:

    if parameters is None:
        parameters = {}

    case_id_glue = exec_utils.get_param_value(Parameters.CASE_ID_KEY, parameters, CASE_CONCEPT_NAME)
    activity_key = exec_utils.get_param_value(Parameters.ACTIVITY_KEY, parameters, DEFAULT_NAME_KEY)
    timestamp_key = exec_utils.get_param_value(Parameters.TIMESTAMP_KEY, parameters, DEFAULT_TIMESTAMP_KEY)
    grouped_df = parameters[GROUPED_DATAFRAME] if GROUPED_DATAFRAME in parameters else None

    if grouped_df is None:
        sorted_df = df.sort_values(by=[timestamp_key])
        grouped_df = sorted_df.groupby(case_id_glue, sort=False)

    startact_dict = dict(Counter(grouped_df[activity_key].first().to_numpy().tolist()))
    sorted_startact_dict = dict(sorted(startact_dict.items(), key=lambda x: x[1], reverse=True))

    return sorted_startact_dict 
aryadegari commented 2 weeks ago

I realized that after calling pm4py.format_dataframe function, the resulting dataframe should remain intact. The solution mentioned above remains valid though, as grouped_df[activity_key].first() only gives valid results if grouped_df[activity_key] is sorted based on timestamps.

Another solution, however, is to call pm4py.format_dataframe on the given dataframe in the beginning of functions.