Event order is incorrect when adjacent events have only miliseconds difference

lmpeiris commented 6 months ago

Summary:

I'm importing a dataframe in to pm4py event logs. In some cases in the event logs, the adjacent events in the same case have only milliseconds difference. When event log start activities are checked, it can be seen that event which comes second in the case is selected as start activity. This is also seen when events are plotted using heuristic miner's visualization output.

` event_df.dtypes id object title object action object user object time datetime64[ns] case object dtype: object

event_log = pm4py.convert_to_event_log(event_df.rename(columns={'case': 'case:concept:name', 'time': 'time:timestamp','action': 'concept:name'})) pm4py.get_start_activities(event_log)

_{'gl_branch_created': 530, 'gl_issue_created': 37, 'gl_MR_created': 286, 'gl_PL_created': 907, 'gl_issueassigned': 11}

problem_logs = pm4py.filter_start_activities(event_log, ['gl_issue_assigned']) problem_df = pm4py.convert_to_dataframe(problem_logs) problem_df

pm4py_timestamp_issue

` It can be clearly seen that the actual start action for the both of the cases shown in the screenshot is 'gl_issue_created'. However, pm4py considered it to be 'gl_issue_assigned'. Same is there for other actions as well .

versions:

Reproduced on pm4py versions 2.7.11.6 and 2.5.0 pandas version is 1.5.3 running on windows 11, python 3.10.9

lmpeiris commented 6 months ago

Tested using timestamp with tz information as well, same result. also tested using pandas 2.2.2

lmpeiris commented 6 months ago

On further checking i found that:

If using format_dataframe method, the issue does not reproduce. I think this is mandatory for proper formatting, after all. If the documentation does not need to be updated, this issue can be closed as invalid.
When using log_converter instead of convert_to_event_log, issue is reproduced - if format_dataframe is not used

fit-alessandro-berti commented 6 months ago

Dear @lmpeiris , the format_dataframe method indeed ensures the correct sorting of events based on timestamp (first sorting criteria inside a case) and event index (second sorting criteria inside a case).

pm4py / pm4py-core