pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Pandas warnings when formating dataframe as event log #351

Closed fmannhardt closed 1 year ago

fmannhardt commented 2 years ago

I am trying to import a CSV file (sepsis.csv) using the following approach:

sepsis = pd.read_csv("sepsis.csv", sep=';')
sepsis_log = pm4py.format_dataframe(sepsis, case_id='case_id', activity_key='activity', timestamp_key='timestamp')
sepsis_log = pm4py.convert_to_event_log(sepsis_log)

The following warnings show up:

XXX\pm4py\utils.py:87: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[constants.CASE_CONCEPT_NAME] = df[constants.CASE_CONCEPT_NAME].astype("string")

XXXX\pm4py\utils.py:89: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[xes_constants.DEFAULT_NAME_KEY] = df[xes_constants.DEFAULT_NAME_KEY].astype("string")

The sepsis file is attached but nothing special in there. I am using Python 3.8 and the packages versions installed are:

Package            Version
------------------ ----------
asttokens          2.1.0     
backcall           0.2.0     
backports.zoneinfo 0.2.1     
certifi            2022.6.15 
colorama           0.4.6     
contourpy          1.0.6     
cvxopt             1.3.0     
cycler             0.11.0    
debugpy            1.6.3     
decorator          5.1.1     
deprecation        2.1.0     
entrypoints        0.4       
executing          1.2.0     
fonttools          4.38.0    
graphviz           0.20.1    
intervaltree       3.1.0     
ipykernel          6.17.0    
ipython            8.6.0     
ipywidgets         8.0.2     
jedi               0.18.1    
Jinja2             3.1.2
joblib             1.2.0
jsonpickle         2.2.0
jupyter_client     7.4.4
jupyter_core       4.11.2
jupyterlab-widgets 3.0.3
kiwisolver         1.4.4
lxml               4.9.1
MarkupSafe         2.1.1
matplotlib         3.6.1
matplotlib-inline  0.1.6
mizani             0.8.1
mpmath             1.2.1
nest-asyncio       1.5.6
networkx           2.8.7
numpy              1.23.4
packaging          21.3
palettable         3.3.0
pandas             1.5.1
parso              0.8.3
patsy              0.5.3
pickleshare        0.7.5
Pillow             9.3.0
pip                22.2.2
plotnine           0.10.1
pm4py              2.2.30
prompt-toolkit     3.0.31
psutil             5.9.3
pure-eval          0.2.2
pydotplus          2.0.2
Pygments           2.13.0
pyparsing          3.0.9
python-dateutil    2.8.2
pytz               2022.6
pyvis              0.3.0
pywin32            304
pyzmq              24.0.1
scikit-learn       1.1.3
scipy              1.9.3
setuptools         65.5.0
six                1.16.0
sklearn            0.0
sortedcontainers   2.4.0
stack-data         0.6.0
statsmodels        0.13.2
StringDist         1.0.9
sympy              1.11.1
threadpoolctl      3.1.0
torch              1.13.0+cpu
tornado            6.2
tqdm               4.64.1
traitlets          5.5.0
typing_extensions  4.4.0
tzdata             2022.6
wcwidth            0.2.5
wheel              0.37.1
widgetsnbextension 4.0.3
wincertstore       0.2
fit-alessandro-berti commented 2 years ago

Dear Felix,

Yes I am aware of that warning. The only thing the row do is to replace the content of the column with itself typed as string. Actually I am not aware of a better method within Pandas.

In release 2.2.30 we deprecated the method format_dataframe though, in favor of a bigger freedom to the user, that needs to format manually the timestamp columns. Also the methods are always accepting the case, activity and timestamp column (so no need to coerce them into case:concept:name, concept:name and time:timestamp )

fmannhardt commented 1 year ago

Do you have a link on the best practice on how to use Pandas data frames from version 2.2.30?

fit-alessandro-berti commented 1 year ago

Sorry my mistake. I meant the upcoming version 2.3.0 (of which we released the rc2)

First we will complete the release process of 2.3.0 and then update the documentation. For some time, documentation in the web site will be overlapping (so every method reported should work for both 2.2.x and 2.3.x). Then it will be more specialized. For the first times, it will be still suggested to use the format_dataframe method even if deprecated.

fmannhardt commented 1 year ago

Ok, then I keep using the method since the course starts in a week. Not a Pandas expert but I found this information: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas So, maybe the method could avoid the warning by locally using pd.options.mode.chained_assignment = None :-)