pymc-labs / pymc-marketing

Bayesian marketing toolbox in PyMC. Media Mix (MMM), customer lifetime value (CLV), buy-till-you-die (BTYD) models and more.
https://www.pymc-marketing.io/
Apache License 2.0
683 stars 190 forks source link

`clv.utils.customer_lifetime_value` generates one row for every chain/draw/customer_id permutation instead of 1 row per customer_id #374

Closed rtol5 closed 10 months ago

rtol5 commented 1 year ago

As the title says, clv.utils.customer_lifetime_value returns an unmanageably large dataset for me. For a sample dataset with 25k users, our final dataset results in 100 million rows.

Below are the key steps I ran, where clv_df_all is my full dataset and clv_df_freq1 filters that dataset down to users with frequency>0:

import pandas as pd
import pymc_marketing.clv as clv
bgm = clv.BetaGeoModel(data = clv_df_freq1)
bgm.build_model()
bgm.fit()
user_ltvs = clv.utils.customer_lifetime_value(
    transaction_model = bgm,
    customer_id = clv_df_all['customer_id'], 
    frequency = clv_df_all['frequency'], 
    recency = clv_df_all['recency'], 
    T = clv_df_all['T'], 
    monetary_value = clv_df_all['monetary_value'], 
    time = 12, 
    discount_rate = 0.01, 
    freq = 'D'
)

Is this expected behavior (i.e. am I missing a post-processing step)? Or is this unintended behavior?

ricardoV94 commented 1 year ago

Yes, we need to add a "thin" option to these methods

rtol5 commented 1 year ago

Thanks @ricardoV94. I'd love to help with a pull request if I knew how to, but unfortunately I don't.

Is there anything else I can do to help with this? This feature would be really helpful for us.

Alternatively, if there's another way to generate CLVs with the pymc_marketing.clv, I'd love to try that. For the "main" method outlined in one of the tutorial notebooks, I'm running into this issue.

rtol5 commented 1 year ago

Actually, reading through the tutorial notebooks again, I see that this section https://www.pymc-marketing.io/en/stable/notebooks/clv/clv_quickstart.html#ranking-customers-from-best-to-worst generates a "thin" output with num_purchases.mean(("chain", "draw")).values.

Just confirming – I should be able to the same with clv.utils.customer_lifetime_value to get my intended output, right?

ricardoV94 commented 1 year ago

Instead of adding a thin to every method, I decided to add it to the model itself. Then a user gets back a model with the thinned dataset, and can call whatever methods they want with it (and doesn't need to destroy the full dataset of the original model)

tomthepeach commented 11 months ago

Hey ricardo, please could you demonstrate in a few lines how this functionality would be used?

ricardoV94 commented 11 months ago

@tomthepeach it will look something like:

fitted_gg_thinned = fitted_gg.thin_fit_result(keep_every=10)
fitted_bg_thinned = fitted_bg.thin_fit_result(keep_every=10)

ggf_clv_thinned = fitted_gg_thinned.expected_customer_lifetime_value(
    transaction_model=fitted_bg_thinned,
    customer_id=t.index,
    frequency=t["frequency"],
    recency=t["recency"],
    T=t["T"],
    mean_transaction_value=t["monetary_value"],
)

You could sample less draws from the get go when calling model.fit, but usually you want enough to at least check convergence.

tomthepeach commented 11 months ago

Looks good! Would be awesome to get this merged into main, perhaps this should be the default/ recommended approach? I'm not sure what the utility is for the current implementation

ricardoV94 commented 11 months ago

Looks good! Would be awesome to get this merged into main, perhaps this should be the default/ recommended approach? I'm not sure what the utility is for the current implementation

Thinning loses information so we shouldn't do by default. It's up to the user to decide if they are no longer going to need all the draws, and that depends very much on their workflow. Hopefully this makes it easier to make that decision.

Thanks for flagging that this feature is relevant for you. We'll try to get it merged soon, I think there were still some tests failing. Follow the PR to be up to date!