Allow higher time resolutions in performance DFG

goto-loop commented 1 year ago

This is more of a feature request than an issue:

The performance DFG (pm4py.discover_performance_dfg() and pm4py.view_performance_dfg()) is a very useful tool for getting an overview of the transition times within a process, even if you don't plot it and just look at the output dictionary with the metrics like median, max, etc. The only disadvantage is that the highest time resolution is seconds. We are currently applying PM4Py to manufacturing data where transitions often take only a few milliseconds, so when we plot the DFG or look at the statistics in the dictionary, we mostly see zeros. It would be great if it was possible to output arbitrary time resolutions like ms or us.

fit-alessandro-berti commented 1 year ago

Dear @goto-loop

By upgrading pm4py to version 2.7.5 and Pandas to release 2.0.3 you will resolve the problem as the granularity will be nanoseconds in the calculation

pip install -U pm4py==2.7.5 pandas==2.0.3

We'll work towards integrating the nanoseconds visualization, when needed, also in the Graphviz visualization.

Cheers Alessandro

goto-loop commented 1 year ago

Hi Alessandro,

thanks for your reply! Just being able to get the transition statistics in ms/us/ns would be great, we can probably find a workaround for the Graphviz visualization.

I've just checked and I'm using python==3.11.4, pandas==2.0.3 and pm4py==2.7.5, but if I run this:

performance_dfg, start_activities, end_activities = pm4py.discover_performance_dfg(dataframe, case_id_key='case:concept:name', activity_key='concept:name', timestamp_key='time:timestamp')
print(performance_dfg)

The statistics are definitely in seconds, not nanoseconds. If I'm not mistaken, those numbers get calculated in df_statistics.py, where .dt.total_seconds() is called on the delta?

fit-alessandro-berti commented 1 year ago

Dear @goto-loop

It is probable that the timestamp column in your dataframe is not formatted correctly. Executing the following example in the root directory of the pm4py repository:

import pandas as pd; import pm4py; dataframe=pd.read_csv("tests/input_data/receipt.csv"); dataframe=pm4py.format_dataframe(dataframe); print(dataframe[["case:concept:name", "concept:name", "time:timestamp"]]);

perf_dfg, start_act, end_act = pm4py.discover_performance_dfg(dataframe); print(perf_dfg)

You would get as first print the following dataframe (notice the timestamp column):

And the following as performance DFG:

{('Confirmation of receipt', 'T02 Check confirmation of receipt'): {'mean': 72163.8380231696, 'median': 35.189, 'max': 6223019.484, 'min': 12.51, 'sum': 77864781.227, 'stdev': 418456.31794732384}, ('Confirmation of receipt', 'T06 Determine necessity of stop advice'): {'mean': 93396.37272384937, 'median': 60.764, 'max': 13553150.937, 'min': 12.65, 'sum': 22321733.081, 'stdev': 900282.8256415251}, ('T02 Check confirmation of receipt', 'T03 Adjust confirmation of receipt'): {'mean': 428325.24804651167, 'median': 456.669, 'max': 3744625.904, 'min': 14.017, 'sum': 18417985.666, 'stdev': 985990.6902439436}, ('T02 Check confirmation of receipt', 'T04 Determine confirmation of receipt'): {'mean': 26183.46229579982, 'median': 27.648, 'max': 2950473.63, 'min': 9.01, 'sum': 29299294.309, 'stdev': 143891.93718245308}, ....

goto-loop commented 1 year ago

@fit-alessandro-berti, I stand corrected. I just had a look at your code snippet, went through my own code again, experimented a bit and you are absolutely right: the granularity is already nanoseconds.

I believe my confusion came from this little detail: If you call pd.Timedelta(1, "ns").total_seconds(), the result is 0.0. However, if you call .dt.total_seconds() on the difference of two datetime columns with nanosecond values, the result is what you would expect.

Thanks a lot for taking the time to clarify this issue!

pm4py / pm4py-core

Allow higher time resolutions in performance DFG #431