modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.59k stars 649 forks source link

Modin can't deal with Lifetimes #4289

Open sergiocalde94 opened 2 years ago

sergiocalde94 commented 2 years ago

System information

import modin.pandas as pd

from lifetimes.datasets import load_transaction_data
from lifetimes.utils import summary_data_from_transaction_data

transaction_data = pd.DataFrame(load_transaction_data())
print(transaction_data.head())

from lifetimes.utils import calibration_and_holdout_data

summary_cal_holdout = calibration_and_holdout_data(transaction_data, 'id', 'date',
                                        calibration_period_end='2014-09-01',
                                        observation_period_end='2014-12-31' )
print(summary_cal_holdout.head())

Modin can't deal with Lifetimes

When using modin together with lifetimes it crashes :(. With pandas it works as expected.

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks in advance for this awesome library! 🙌

sergiocalde94 commented 2 years ago

Note: Lifetime version is 0.11.3

mvashishtha commented 2 years ago

@sergiocalde94 thank you for reporting the bug. I can reproduce it locally with the latest Modin source. Here's the full stack trace:

Show stack trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [1], in 7 print(transaction_data.head()) 9 from lifetimes.utils import calibration_and_holdout_data ---> 11 summary_cal_holdout = calibration_and_holdout_data( 12 transaction_data, 13 "id", 14 "date", 15 calibration_period_end="2014-09-01", 16 observation_period_end="2014-12-31", 17 ) 18 print(summary_cal_holdout.head()) File /usr/local/lib/python3.9/site-packages/lifetimes/utils.py:97, in calibration_and_holdout_data(transactions, customer_id_col, datetime_col, calibration_period_end, observation_period_end, freq, freq_multiplier, datetime_format, monetary_value_col, include_first_transaction) 94 transaction_cols.append(monetary_value_col) 95 transactions = transactions[transaction_cols].copy() ---> 97 transactions[datetime_col] = pd.to_datetime(transactions[datetime_col], format=datetime_format) 98 observation_period_end = pd.to_datetime(observation_period_end, format=datetime_format) 99 calibration_period_end = pd.to_datetime(calibration_period_end, format=datetime_format) File /usr/local/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1074, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache) 1072 cache_array = Series([], dtype=object) # just an empty array 1073 if not cache_array.empty: -> 1074 result = _convert_and_box_cache(arg, cache_array) 1075 else: 1076 result = convert_listlike(arg, format) File /usr/local/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:256, in _convert_and_box_cache(arg, cache_array, name) 239 """ 240 Convert array of dates with a cache and wrap the result in an Index. 241 (...) 252 result : Index-like of converted dates 253 """ 254 from pandas import Series --> 256 result = Series(arg).map(cache_array) 257 return _box_as_indexlike(result._values, utc=None, name=name) File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:417, in Series.__init__(self, data, index, dtype, name, copy, fastpath) 415 data = data._mgr 416 elif is_dict_like(data): --> 417 data, index = self._init_dict(data, index, dtype) 418 dtype = None 419 copy = False File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:488, in Series._init_dict(self, data, index, dtype) 484 keys: Index | tuple 486 # Looking for NaN in dict doesn't work ({np.nan : 1}[float('nan')] 487 # raises KeyError), so we iterate the entire dict, and align --> 488 if data: 489 # GH:34717, issue was using zip to extract key and values from data. 490 # using generators in effects the performance. 491 # Below is the new way of extracting the keys and values 493 keys = tuple(data.keys()) 494 values = list(data.values()) # Generating list of values- faster way File ~/modin/modin/pandas/base.py:3108, in BasePandasDataset.__nonzero__(self) 3107 def __nonzero__(self): -> 3108 raise ValueError( 3109 f"The truth value of a {self.__class__.__name__} is ambiguous. " 3110 + "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 3111 ) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ```

We have had similar issues with other python packages that use pandas, e.g. plotly in #3211. We'll see if there's something we can do to address this bug.

tianlinzx commented 2 years ago

@mvashishtha Any progress on this ?

mvashishtha commented 2 years ago

@tianlinzx unfortunately not. Whenever someone is ready to start work on this issue, someone from modin-project/modin-contributors or modin-project/modin-core should assign them to this issue.

tianlinzx commented 1 year ago

Would you please give this issue a higher priority ?

vnlitvinov commented 1 year ago

Meanwhile you can do this:

from modin.utils import to_pandas

summary_cal_holdout = calibration_and_holdout_data(to_pandas(transaction_data), 'id', 'date',
                                        calibration_period_end='2014-09-01',
                                        observation_period_end='2014-12-31' )
mvashishtha commented 1 year ago

@tianlinzx I don't think there's any way to get compatibility by changing Modin. Modin objects can't figure out in what context they are being used. Workarounds like the one @vnlitvinov suggests will work for now, but will pay the performance penalty of converting Modin's objects to pandas, and will require changes everywhere you use problematic lifetimes methods.

For a more comprehensive fix, lifetimes will need to make sure that it never passes modin.pandas objects to pandas functions, and that it never uses both modin.pandas objects and pandas objects in a single function call. In the case of the stack trace in my comment here, the offending line in lifetimes is transactions[datetime_col] = pd.to_datetime(transactions[datetime_col], format=datetime_format). That could be changed to something like:

col = transactions[datetime_col]
if isinstance(col, modin.pandas.Series):
  lib = modin.pandas
else:
  lib = pd
transactions[datetime_col] = lib.to_datetime(col, format=datetime_format)

I don't know where else the lifetimes library would need to be changed to support modin.pandas inputs everywhere.

@tianlinzx Would you be able to open a feature request for Modin support in the lifetimes GitHub repo?