filter_variants_top_k implicitly relies on dataframe order

pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.

https://pm4py.fit.fraunhofer.de

GNU General Public License v3.0

722 stars 286 forks source link

filter_variants_top_k implicitly relies on dataframe order #419

Closed awth13 closed 1 year ago

awth13 commented 1 year ago

filter_variants_top_k relies on the event_log dataframe being sorted by case and timestamp (see pm4py.objects.log.util.pandas_numpy_variants.apply(), which is called to retrieve variants to filter) but this is not documented anywhere.

I am not sure if this affects other filters. I am also not sure if this behaviour is intended (should be documented) or not (should be fixed).

fit-alessandro-berti commented 1 year ago

Dear awth13, thanks for the question.

Yes, to compute variants more efficiently, we need consecutive events of the same case to be consecutive also in the dataframe. So the behavior is intended