pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Event order is incorrect when adjacent events have only miliseconds difference #481

Closed lmpeiris closed 6 months ago

lmpeiris commented 6 months ago

Summary:

I'm importing a dataframe in to pm4py event logs. In some cases in the event logs, the adjacent events in the same case have only milliseconds difference. When event log start activities are checked, it can be seen that event which comes second in the case is selected as start activity. This is also seen when events are plotted using heuristic miner's visualization output.

` event_df.dtypes id object title object action object user object time datetime64[ns] case object dtype: object

event_log = pm4py.convert_to_event_log(event_df.rename(columns={'case': 'case:concept:name', 'time': 'time:timestamp','action': 'concept:name'})) pm4py.get_start_activities(event_log)

_{'gl_branch_created': 530, 'gl_issue_created': 37, 'gl_MR_created': 286, 'gl_PL_created': 907, 'gl_issueassigned': 11}

problem_logs = pm4py.filter_start_activities(event_log, ['gl_issue_assigned']) problem_df = pm4py.convert_to_dataframe(problem_logs) problem_df

pm4py_timestamp_issue

` It can be clearly seen that the actual start action for the both of the cases shown in the screenshot is 'gl_issue_created'. However, pm4py considered it to be 'gl_issue_assigned'. Same is there for other actions as well .

versions:

Reproduced on pm4py versions 2.7.11.6 and 2.5.0 pandas version is 1.5.3 running on windows 11, python 3.10.9

lmpeiris commented 6 months ago

Tested using timestamp with tz information as well, same result. also tested using pandas 2.2.2

lmpeiris commented 6 months ago

On further checking i found that:

fit-alessandro-berti commented 6 months ago

Dear @lmpeiris , the format_dataframe method indeed ensures the correct sorting of events based on timestamp (first sorting criteria inside a case) and event index (second sorting criteria inside a case).