Closed lmpeiris closed 6 months ago
Tested using timestamp with tz information as well, same result. also tested using pandas 2.2.2
Dear @lmpeiris , the format_dataframe method indeed ensures the correct sorting of events based on timestamp (first sorting criteria inside a case) and event index (second sorting criteria inside a case).
Summary:
I'm importing a dataframe in to pm4py event logs. In some cases in the event logs, the adjacent events in the same case have only milliseconds difference. When event log start activities are checked, it can be seen that event which comes second in the case is selected as start activity. This is also seen when events are plotted using heuristic miner's visualization output.
` event_df.dtypes id object title object action object user object time datetime64[ns] case object dtype: object
event_log = pm4py.convert_to_event_log(event_df.rename(columns={'case': 'case:concept:name', 'time': 'time:timestamp','action': 'concept:name'})) pm4py.get_start_activities(event_log)
_{'gl_branch_created': 530, 'gl_issue_created': 37, 'gl_MR_created': 286, 'gl_PL_created': 907, 'gl_issueassigned': 11}
problem_logs = pm4py.filter_start_activities(event_log, ['gl_issue_assigned']) problem_df = pm4py.convert_to_dataframe(problem_logs) problem_df
` It can be clearly seen that the actual start action for the both of the cases shown in the screenshot is 'gl_issue_created'. However, pm4py considered it to be 'gl_issue_assigned'. Same is there for other actions as well .
versions:
Reproduced on pm4py versions 2.7.11.6 and 2.5.0 pandas version is 1.5.3 running on windows 11, python 3.10.9