pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

prefix_matrix vs. prefix_variants_matrix #94

Closed Rizzey closed 5 years ago

Rizzey commented 5 years ago

Just looking for a clarification: is this intentional or have I found a bug? I was messing about trying to do my own predictions on the remaining time left in a case and I found this. There are two ways of getting a prefix matrix from the log utility functions in pm4py.objects.log.util but they both return two different prefix matrices. get_prefix_matrix returns the behaviour I was expecting but I tried get_prefix_variants_matrix first and I'm not sure what that is.
If you could

Code:

import pandas as pd
import numpy as np
from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.objects.conversion.log import factory as log_conv_factory
import pm4py.objects.log.util.prefix_matrix as prefix_stuff

dataframe = csv_import_adapter.import_dataframe_from_path("./pm4py-source/tests/input_data/running-example.csv", sep=",")

log = log_conv_factory.apply(dataframe)

#this code generates what i expected from a prefix matrix
variants_matrix,names = prefix_stuff.get_variants_matrix(log=log)
prefixes_matrix,names = prefix_stuff.get_prefix_matrix(log=log)

#this code does not?
prefixes_matrix_2,variants_matrix_2,pames = prefix_stuff.get_prefix_variants_matrix(log)
#names = np.array(names)
#prefixes = np.char.multiply(names,prefix_matrix)
#prefixes_2 = np.char.multiply(names,prefix_matrix_2)

print("prefixes_matrix =\n ",prefixes_matrix,"\n prefixes_matrix_2 = \n",prefixes_matrix_2,"\n",prefixes_matrix==prefixes_matrix_2)

print("variants_matrix=\n",variants_matrix,"\n variants_matrix_2 = \n",variants_matrix_2,"\n",variants_matrix==variants_matrix_2)

help(pm4py)

Output

>>> prefixes_matrix =
  [[0 0 0 0 0 1 0 0]
 [0 0 1 0 0 1 0 0]
 [1 0 1 0 0 1 0 0]
 [1 1 1 0 0 1 0 0]
 [1 1 1 0 0 1 1 0]
 [1 1 1 1 0 1 1 0]
 [2 1 1 1 0 1 1 0]
 [2 2 1 1 0 1 1 0]
 [2 2 1 1 1 1 1 0]
 [0 0 0 0 0 1 0 0]
 [1 0 0 0 0 1 0 0]
 [1 0 1 0 0 1 0 0]
 [1 1 1 0 0 1 0 0]
 [1 1 1 0 1 1 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 1 0 1 0 0]
 [1 0 0 1 0 1 0 0]
 [1 1 0 1 0 1 0 0]
 [1 1 0 1 0 1 0 1]
 [0 0 0 0 0 1 0 0]
 [0 0 1 0 0 1 0 0]
 [1 0 1 0 0 1 0 0]
 [1 1 1 0 0 1 0 0]
 [1 1 1 0 1 1 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 1 0 0 1 0 0]
 [1 0 1 0 0 1 0 0]
 [1 1 1 0 0 1 0 0]
 [1 1 1 0 0 1 1 0]
 [2 1 1 0 0 1 1 0]
 [2 1 2 0 0 1 1 0]
 [2 2 2 0 0 1 1 0]
 [2 2 2 0 0 1 2 0]
 [2 2 3 0 0 1 2 0]
 [3 2 3 0 0 1 2 0]
 [3 3 3 0 0 1 2 0]
 [3 3 3 0 0 1 2 1]
 [0 0 0 0 0 1 0 0]
 [1 0 0 0 0 1 0 0]
 [1 0 0 1 0 1 0 0]
 [1 1 0 1 0 1 0 0]
 [1 1 0 1 0 1 0 1]] 
 prefixes_matrix_2 = 
 [[0 0 0 0 0 6 0 0]
 [0 0 0 1 0 1 0 0]
 [0 0 3 0 0 3 0 0]
 [1 1 1 1 0 1 1 0]
 [2 0 0 0 0 2 0 0]
 [2 0 0 2 0 2 0 0]
 [2 1 1 0 0 1 1 0]
 [2 1 1 1 0 1 1 0]
 [2 1 2 0 0 1 1 0]
 [2 2 0 2 0 2 0 0]
 [2 2 0 2 0 2 0 2]
 [2 2 1 1 0 1 1 0]
 [2 2 1 1 1 1 1 0]
 [2 2 2 0 0 1 1 0]
 [2 2 2 0 0 1 2 0]
 [2 2 2 0 0 2 2 0]
 [2 2 2 0 2 2 0 0]
 [2 2 3 0 0 1 2 0]
 [3 2 3 0 0 1 2 0]
 [3 3 3 0 0 1 2 0]
 [3 3 3 0 0 1 2 1]
 [4 0 4 0 0 4 0 0]
 [4 4 4 0 0 4 0 0]] 
 False
variants_matrix=
 [[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 1 0 0]
 [2 2 1 1 1 1 1 0]
 [3 3 3 0 0 1 2 1]] 
 variants_matrix_2 = 
 [[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 1 0 0]
 [2 2 1 1 1 1 1 0]
 [3 3 3 0 0 1 2 1]] 
 [[ True  True  True  True  True  True  True  True]
 [ True  True  True  True  True  True  True  True]
 [ True  True  True  True  True  True  True  True]
 [ True  True  True  True  True  True  True  True]]
Help on package pm4py:

NAME
    pm4py - Process Mining for Python

PACKAGE CONTENTS
    algo (package)
    evaluation (package)
    objects (package)
    statistics (package)
    util (package)
    visualization (package)

DATA
    __author_email__ = 'pm4py@pads.rwth-aachen.de'
    __maintainer__ = 'PADS'
    __maintainer_email__ = 'pm4py@pads.rwth-aachen.de'

VERSION
    1.1.15

AUTHOR
    PADS

FILE
    /home/river/anaconda3/lib/python3.7/site-packages/pm4py/__init__.py

>>> 
Javert899 commented 5 years ago

Clarification:

The get_prefix_variants_matrix is intended to be used in region-based algorithms (like ILP, places discovery algorithms). Indeed, in those algorithms you need 1) unique rows in the matrixes 2) more importance to entries that occur often.

I personally do not suggest to use this for remaining time prediction. There already way more useful (for the scope) representations, like the pm4py.objects.log.util.get_default_representation one.

Rizzey commented 5 years ago

Thanks for the clarification. In the meantime, I'm after finding a new issue that may be affecting your prediction branches - I'll close this and make a new one.