pymc-labs / pymc-marketing

Bayesian marketing toolbox in PyMC. Media Mix (MMM), customer lifetime value (CLV), buy-till-you-die (BTYD) models and more.
https://www.pymc-marketing.io/
Apache License 2.0
683 stars 190 forks source link

bug in clv.utils.rfm_train_test_split() #688

Closed jdsemeyn closed 4 months ago

jdsemeyn commented 5 months ago

When attempting to use the clv.utils.rfm_train_test_split() function, I ran into the following error: KeyError: 'customer_id' After testing it again with publicly available data (the online retail data from uci), I got the same error. Looking at the code, I believe it is being introduced by lines 565-567: test_rfm_data = test_rfm_data.rename( columns={"id": "customer_id", "date": "test_frequency"} ) where "id" and "date" should instead be customer_id_col and datetime_col. As is, if your id col and date col are not named exactly "id" and "date", they will not be renamed properly, which causes the error when the code then attempts to merge with the monetary value (lines 579-584). If a monetary_value is not provided, then the error occurs later in lines 589-591 when it tries to merge back to the training_rfm_data.

How to reproduce error: use clv.utils.rfm_train_test_split() with dataset that does not have the customer_id column explicitly named "customer_id" and the date column not explicitly named "date".

wd60622 commented 5 months ago

Hi @jdsemeyn,

Thank you for raising this issue. Are you able to get a work around for this issue with the latest version of the package?

Would you be interested in making this fix in a PR? It may be good to raise an error earlier on if the required columns don't exist!

jdsemeyn commented 5 months ago

@wd60622 I can work around it by just setting the column names in my data to the ones used by the function.

I'll look into doing a PR, it should be an easy fix. Not sure that there needs to be separate checking to see if the required columns don't exist since the function requires specifying the intended columns, so would throw an error anyway if those columns don't exist.

wd60622 commented 5 months ago

I'll look into doing a PR, it should be an easy fix. Not sure that there needs to be separate checking to see if the required columns don't exist since the function requires specifying the intended columns, so would throw an error anyway if those columns don't exist.

Great! I just suggest that to fail earlier and to provide a less cryptic error message

I am rereading your original message and see that you are using a 3rd party data set. Is there a renaming requirement for the datasets that pymc-marketing provides? reference

jdsemeyn commented 5 months ago

The example dataset used for clv, cdnow_transactions, has the id column as "id" and the date column as "date", which is likely why the issue wasn't caught before; however, the function should be able to accept column names that don't exactly match "id" and "date" as it has you specify which column is the id column and which column is the date column and handles the renaming. A similar function existed and worked as expected with 3rd party data without renaming in both the lifetimes library and the btyd library. Additionally, the rfm_summary() function has similar inputs and works as expected with 3rd party data, again without having to rename the columns prior to using the function.

wd60622 commented 5 months ago

Are you able to use the customer_id_col and datetime_col arguments of the rfm_train_test_split function? Don't these parameters accommodate your needs?

jdsemeyn commented 5 months ago

Those parameters should accommodate my needs, and that's the crux of the issue. The function tries to rename a column explicitly named "id" and a column explicitly named "date" rather than the inputs customer_id_col and datetime_col, which then causes the KeyError. An error won't be raised when trying to rename the nonexistent "id" and "date" columns because the pandas .rename doesn't care if the things it is trying to rename exist or not. The error is caused by the following code, which is lines 565-567 in clv utils: test_rfm_data = test_rfm_data.rename(columns={"id": "customer_id", "date": "test_frequency"})

wd60622 commented 5 months ago

Understood! thanks for catching me up to speed

If you creating a PR, we will need a test case which seems to be missing at the moment