`clv_summary` Enhancements

ColtAllen commented 9 months ago

The clv_summary function is the primary data preprocessing step for BetaGeoModel, ParetoNBDModel, and GammaGammaModel. It has several shortcomings:

[x] All CLV models expect a customer_id column, but this function is not creating one.
[x] pandas.sort_values is being called internally by this function, which can cause memory crashes with large datasets (say, >10M rows). I'm not aware of a viable workaround for this, but a UserWarning can be added and/or a sort_values parameter to skip this operation if sorting is already being applied on the DB side.
[x] To perform RFM Analysis, an include_first_transactions parameter must be added. This can be adapted from lifetimes
[ ] RFM analysis also has a different interpretation for recency than what is used for modeling. To reduce confusion, let's just add an additional column for this to the output DF.
[x] On that note, the function should probably just be renamed to rfm_summary.

ricardoV94 commented 8 months ago

@ColtAllen is this one done, or that extra column still needed?

ColtAllen commented 8 months ago

@ColtAllen is this one done, or that extra column still needed?

I plan to open a PR for an rfm_segmentation utility in the near future. We can just apply the column transformation within that function, because I'd rather not add an unnecessary column to the rfm_summary output.

pymc-labs / pymc-marketing

`clv_summary` Enhancements #469