pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Problem using duration_diagnostics.diagnose_from_notexisting_activities #444

Closed JazminADiaz closed 1 year ago

JazminADiaz commented 1 year ago

Hi there, I don't know if I'm doing something wrong, when I check for unwanted activities I get the dictionary, but when I print act_diagnostics, is empy, I'm using the same log I used to create unwanted_activities, please help.

parameters_tbr = {token_based_replay.Variants.TOKEN_REPLAY.value.Parameters.DISABLE_VARIANTS: True, token_based_replay.Variants.TOKEN_REPLAY.value.Parameters.ENABLE_PLTR_FITNESS: True}

`replayed_traces, place_fitness, trans_fitness, unwanted_activities = token_based_replay.apply(log, net_comp, im_comp, fm_comp,
                                                                                           parameters=parameters_tbr)`

act_diagnostics = duration_diagnostics.diagnose_from_notexisting_activities(log, unwanted_activities)

print(act_diagnostics) for act in act_diagnostics: print(act, act_diagnostics[act])

fit-alessandro-berti commented 1 year ago

Dear @JazminADiaz

unwanted_activities should contain activities in the log that are not in the model. In the case that all the activities of the log are in the model, then the result is empty

JazminADiaz commented 1 year ago

That is not the case, I'm obtaing unwated activities in another way since I couldn't make the code you provided work, here is how I did it:

`unpredicted_activity_details = []

for trace in log:
    attributes = trace.attributes
    for event in trace:
        activity_name = event['concept:name']
        if activity_name not in [transition.label for transition in net_comp.transitions]:
            case_id = attributes.get('concept:name', '') 
            timestamp = event.get('time:timestamp', '')
            resource = event.get('org:resource', '')
            unpredicted_activity_details.append({
                "ID case": case_id,
                "Activity": activity_name,
                "Time_Stamp": timestamp,
                "Resource": resource,
                "Event": event  
            })

` There are plenty of activities that are not wanted, I would pretty much rather to use your code, if you explain what may be happening, I'm happy to provide you any info you need

fit-alessandro-berti commented 1 year ago

Dear @JazminADiaz

I have reproduced the problem. In pm4py 2.3.0 we changed the default log format to dataframe. For that method, you still need to make sure to use the EventLog class. You can take a look at the following example, where the read_xes method is used along with the option to get an EventLog back:

import pm4py
from pm4py.algo.conformance.tokenreplay.diagnostics import duration_diagnostics

log = pm4py.read_xes("tests/input_data/receipt.xes", return_legacy_log_object=True)
filtered_log = pm4py.filter_variants_top_k(log, 1)
net, im, fm = pm4py.discover_petri_net_inductive(filtered_log)
replayed_traces, place_fitness, trans_fitness, unwanted_activities = pm4py.conformance_diagnostics_token_based_replay(log, net, im, fm, opt_parameters={"enable_pltr_fitness": True})
print(unwanted_activities)
act_diagnostics = duration_diagnostics.diagnose_from_notexisting_activities(log, unwanted_activities)
print(act_diagnostics)

We will fix it in a more proper way in a next release.

JazminADiaz commented 1 year ago

I don't really know how I can use that, I have a csv. I convert it to a dataframe, then I use the log_converter from pm4py, and I use that log, where should I introduce the return_legacy_log_object=True?

Thank you for your answer btw!

I don't know if it helps but with other parts of the code you provide in your page I have had issues becuase the activities have two elementes, a really long id (I'm assuming is an id) and a label, the actual name of the activity, I usually have to add an extraction part of the label in the code to make it work, I don't know if that has to do anything with it but anyway.

fit-alessandro-berti commented 1 year ago

Then that is already good. You do not need the part with return_legacy_log_object=True.

To make sure that the columns have the correct typing, for example the timestamp should be a datetime column, you can use pm4py.format_dataframe(.....) which actually ensures that. Check https://pm4py.fit.fraunhofer.de/documentation for the syntax of the command

JazminADiaz commented 1 year ago

I haven't really had any issues with the timestamps or anything, your filters that relay on the timestamp work just fine, is just the unwanted activities I haven't been able to make it work

NKMatha commented 1 year ago

Hi @fit-alessandro-berti, I got the act_diagnostics dict but am not clear with data.

{'Event_X': {'n_containing': 30, 'n_fit': 2163, 'fit_median_time': 3029460.0, 'containing_median_time': 2592000.0, 'relative_throughput': 0.8555980273712147},

'Event_Y': {'n_containing': 11, 'n_fit': 2163, 'fit_median_time': 3029460.0, 'containing_median_time': 2592000.0, 'relative_throughput': 0.8555980273712147},

'Event_Z': {'n_containing': 1, 'n_fit': 2163, 'fit_median_time': 3029460.0, 'containing_median_time': 2592000.0, 'relative_throughput': 0.8555980273712147}}

Can you please clarify me what actually they are representing(n_containing, n_fit, fit_median_time, containing_median_time)

And what was n_fit & n_underfed from trans_diagnostics( diagnose_from_trans_fitness) .

fit-alessandro-berti commented 1 year ago

containing traces => the number of cases that contain at least an event with the specified activity fit traces => the number of cases that do NOT contain an event with the specified activity

fit_median_time => among all the cases that are "fit" according to the aforementioned criteria (so without the activity), compute the median time containing_median_time => among all the cases containing one event with the given activity, compute the median time

relative_throughput = containing_median_time / fit_median_time

When the relative_throughput is greater than 1, then the activity leads to an increase of the throughput times. Otherwise, it does not lead to an increase of the throughput times.

NKMatha commented 1 year ago

Thanks @fit-alessandro-berti

By the way will its applies the same in trans_diagnostics

fit-alessandro-berti commented 1 year ago

Yes exactly :)