pm4py / pm4py-core

Public repository for the PM4Py (Process Mining for Python) project.
https://pm4py.fit.fraunhofer.de
GNU General Public License v3.0
722 stars 286 forks source link

Cannot interpret 'StringDtype' as a data type #346

Closed dpalmasan closed 2 years ago

dpalmasan commented 2 years ago

Context

> python --version
Python 3.8.6

Hello, I have a dataset in which I have three columns:

> df.head(3)
case_id activity event_ts
114612950671529 activity1 1645703155
141116910633456 activity2 1601435806
141116910633456 activity2 1601436080
> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000000 entries, 0 to 4999999
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   case_id       Int64 
 1   activity  object
 2   event_ts    Int32 
dtypes: Int32(1), Int64(1), object(1)
memory usage: 104.9+ MB

But when I run:

import pm4py

event_log = pm4py.format_dataframe(
    df, case_id="case_id", activity_key="activity", timestamp_key="event_ts"
)
event_log.head()

I get:

Cannot interpret 'StringDtype' as a data type
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-f164ec2c97db> in <module>
      2 
      3 
----> 4 event_log = pm4py.format_dataframe(
      5     dfn, case_id="bm_id", activity_key="activity", timestamp_key="timestamp"
      6 )
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pm4py/utils.py in format_dataframe(df, case_id, activity_key, timestamp_key, start_timestamp_key, timest_format)
     85                            xes_constants.DEFAULT_TIMESTAMP_KEY}, how="any")
     86     # make sure the case ID column is of string type
---> 87     df[constants.CASE_CONCEPT_NAME] = df[constants.CASE_CONCEPT_NAME].astype("string")
     88     # make sure the activity column is of string type
     89     df[xes_constants.DEFAULT_NAME_KEY] = df[xes_constants.DEFAULT_NAME_KEY].astype("string")
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5696         else:
   5697             # else, only a single dtype is given
-> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5699             return self._constructor(new_data).__finalize__(self)
   5700 
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    580 
    581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    583 
    584     def convert(self, **kwargs):
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    605         if self.is_extension:
    606             # TODO: Should we try/except this astype?
--> 607             values = self.values.astype(dtype)
    608         else:
    609             if issubclass(dtype.type, str):
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/arrays/integer.py in astype(self, dtype, copy)
    467             kwargs = {}
    468 
--> 469         data = self.to_numpy(dtype=dtype, **kwargs)
    470         return astype_nansafe(data, dtype, copy=False)
    471 
/data/users/dpalmasan/fbsource/buck-out/v2/gen/fbcode/4b41d6efb851df45/scripts/dpalmasan/__bento_kernel_bi_process_mining__/bento_kernel_bi_process_mining#link-tree/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
    133             data[self._mask] = na_value
    134         else:
--> 135             data = self._data.astype(dtype, copy=copy)
    136         return data
    137 
TypeError: Cannot interpret 'StringDtype' as a data type

What I have tried

Am I doing something wrong? I have followed the tutorial, I just changed the data source.

dpalmasan commented 2 years ago

I took a deeper look, it seems the problem is my column for case_id is an Int64, somehow it cannot be converted to string (Maybe my pandas version is old, I have seen some related issues in pandas repo).

fit-alessandro-berti commented 2 years ago

Dear @dpalmasan

Thank you for the question. I would suggest you to update Pandas to the latest version (pip install -U pandas).

It may be that in your case, you converted the dataframe to Pandas starting from a Spark dataframe. Since the data types may be different than the standard Pandas ones, this would explain the problem. Is this the case?

dpalmasan commented 2 years ago

@fit-alessandro-berti Sure, I think this issue can be closed, it was probably a known pandas issue that probably is already fixed. Sadly I cannot test it as I cannot update pandas in the machine I am working on; However I fixed in my case by reading directly the id as a string from the data source.