pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
702 stars 277 forks source link

attribute driven trace clustering algorithm does not work - empty distance matrix #494

Closed PhoenixRising93 closed 1 month ago

PhoenixRising93 commented 2 months ago

Hi, I have a problem with the trace clustering algorithm provided in: https://github.com/caoyukun0430/pm4py-source/tree/yukun_paper I copied the folder to \anaconda3\Lib\site-packages\pm4py\algo\trace_cluster

Then I tried to use the apply function described here: https://pm4py.fit.fraunhofer.de/static/assets/api/2.7.11/pm4py.algo.clustering.trace_attribute_driven.html

In Spyder it showed the error:

  File ~\anaconda3\lib\site-packages\pm4py\algo\clustering\trace_attribute_driven\algorithm.py:114 in apply
    Z = linkage(y, method='average')

  File ~\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py:1068 in linkage
    n = int(distance.num_obs_y(y))

  File ~\anaconda3\lib\site-packages\scipy\spatial\distance.py:2572 in num_obs_y
    raise ValueError("The number of observations cannot be determined on "

ValueError: The number of observations cannot be determined on an empty distance matrix.

At first I thought it was a problem with my project-specific EventLog because I selected a categorical feature as an attribute. I also sliced the dataframe because I thought I had too many events in the log - namely 400,000 - but this did not change the error outcome.

So I tried the Receipt.xes file from the trace_cluster folder. This is the code:

import pm4py
from pm4py.algo.clustering.trace_attribute_driven import algorithm 
from pm4py.algo.clustering.trace_attribute_driven.algorithm import Variants

log = pm4py.read_xes(r"...\anaconda3\Lib\site-packages\pm4py\algo\trace_cluster\example\real_log\Receipt.xes")

#variant = trace_clustering.Variants.DMM_LEVEN

variant = Variants.VARIANT_DMM_LEVEN

pm4py.algo.clustering.trace_attribute_driven.algorithm.apply(log, 'case:responsible', variant)

This error showed again:

ValueError: The number of observations cannot be determined on an empty distance matrix.

I also tried different attributes but that did not work either. In my initial try with the project-specific data I converted the dataframe with 'log = pm4py.convert_to_event_log(dataframe)'

I looked into the source code of num_obs_y and this errors occurs when k == 0. It seems that no distance matrix is calculated at all. What is the issue here?

Could you please provide an example code in the documentation of how the implemented trace clustering works?

Thank you.

fit-alessandro-berti commented 1 month ago

Dear @PhoenixRising93

The dataframe is internally converted to an EventLog. Therefore, you should call the function with 'responsible' instead of 'case:responsible'