pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Issue with edge frequency count and ordering #166

Closed serkserk closed 4 years ago

serkserk commented 4 years ago

I have an issue where depending on the ordering of my data, I will have various frequency value with a DFG

You can see on this notebook : https://gist.github.com/serkserk/9b8e7539e72576ff49d740abf41040b7

I first use my data without any ordering and for example, the edge "2- En vivier, 4- Contractualisé" count is 6 wich is correct Then I try with pm4py util but got 14 wichh is wrong (and same with my custom ordering)

Javert899 commented 4 years ago

Dear Serkserk,

The problem with the sorting on timestamps occurs if you have several events having the same timestamp.

You need to define an index column in the dataframe, and use that as secondary attribute for the sort_values. Example:

df["@@index"] = df.index df = df.sort_values(["time:timestamp", "@@index"])

serkserk commented 4 years ago

Case (with duplicate timestamp/row) that does not have the events "2- En vivier" followed by "4- Contractualisé" affect this edge frequency, is this normal ?

oscar-ramsing commented 4 years ago

Nice guys

On 10 Jul 2020, at 16.49, Serkan notifications@github.com wrote:

Case that does not have the events "2- En vivier" followed by "4- Contractualisé" affect this edge frequency, is this normal ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pm4py/pm4py-source/issues/166#issuecomment-656715502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOWNFR5HYF6ELW3TN45G5PLR24S6TANCNFSM4OVVMKGQ.

fit-alessandro-berti commented 4 years ago

Yes, duplicate timestamps can be mismanaged by the sort operation, hence the need to sort with double key