Plots for CLV model datasets

pymc-labs / pymc-marketing

Bayesian marketing toolbox in PyMC. Media Mix (MMM), customer lifetime value (CLV), buy-till-you-die (BTYD) models and more.

https://www.pymc-marketing.io/

Apache License 2.0

683 stars 190 forks source link

Plots for CLV model datasets #343

Closed wd60622 closed 11 months ago

wd60622 commented 1 year ago

In conjunction to the evaluation plots initiative here, would it make sense to support some plots on the input data set? Just in order to quickly glance in the data sets being used in the model.

I currently have a "Customer Exposure" plot that might be a candidate here.

I like this plot because it quickly shows that many customers have likely stopped making purchases.

It also visually shows the definitions of the "recency" and "T". I often find the "recency" definition is confusing to those not use to the models and this might provide some clarity to the required format.

The code to create this plot might look like this:

df_user_level: pd.DataFrame = ...

from pymc_marketing.clv import plot_exposure

(
    df_user_level
    .sample(n=100)
    .sort_values(["recency", "T"])
    .pipe(plot_exposure)
)

To be clear, my anti-goal here would be to provide wrappers around histograms and simple to make plots

ricardoV94 commented 1 year ago

It would make sense imo. This is about pre-packaged models for specific types of data so anything that helps with the workflow sounds useful.

I would only pay attention to keep them dumb and try not to implement similar things multiple times.

wd60622 commented 1 year ago

Totally, @ricardoV94

I'll get started with this type of plot first then and can discuss in PR if repetitive

ColtAllen commented 1 year ago

I like this plot idea a lot! Have you done it with large datasets? Many people doing BTYD modeling work with datasets involving millions of customers, so I'm curious how this would scale. Some of the other plotting functions take more time to compile than to fit a model (at least for MAP fits).

wd60622 commented 1 year ago

Obviously sampling is one approach to get sense of the data. Maybe there is a spin on this plot that takes a histogrammed data as input.

For instance,

data = [
    (0, 1, 10), 
    (0, 2, 30), 
    ..., 
    (
]
df_hist = pd.DataFrame(data, columns=["recency", "T", "count"])

Maybe a helper to make this binned set.

Might have to be rectangles instead of lines and then no points. X axis is same and Y is maybe cum percentage or count of the count column

wd60622 commented 1 year ago

Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool

wd60622 commented 1 year ago

Had something like this in mind. This is created from the whole data set instead of just a sample of 100 customers.

Results will be binned before plotting, so this could be customizable

Y axis could support percentages as well as the counts shown above

wd60622 commented 1 year ago

Input data could drastically reduce depending on the bin size


    recency     T  count
0       0.0  27.0    395
1       0.0  30.0    411
2       0.0  33.0    404
3       0.0  36.0    319
4       3.0  27.0     18
5       3.0  30.0     22
6       3.0  33.0     21
7       3.0  36.0     22
8       6.0  27.0     25
..       ...   ...    ...

[45 rows x 3 columns]

ColtAllen commented 1 year ago

Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool

lifetimes does this prior to model fitting. As a utility function this could improve the performance of MAP fits, but I don't know if it's viable for MCMC sampling.

wd60622 commented 1 year ago

Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool

lifetimes does this prior to model fitting. As a utility function this could improve the performance of MAP fits, but I don't know if it's viable for MCMC sampling.

Are you talking about the use of the weights here? https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L189-L197

ColtAllen commented 1 year ago

Are you talking about the use of the weights here? https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L189-L197

Yes; both lifetimes and pymc.Model.find_map() call scipy.optimize.minimize under the hood, but lifetimes applies these weights to the logp expression prior to being passed into the optimizer.

@ricardoV94 correct me if I'm wrong, but it seems sample weights functionality would require changes within pymc, and the fact Customer ID is an Xarray dimension will inhibit this regardless.

Even for datasets with millions of customers, find_MAP usually finishes within a minute unless there are data quality issues impeding convergence. This in mind, we should probably keep this weights/counts function within the plotting module. It might also be useful for the matrix plots.

wd60622 commented 1 year ago

Pymc-marketing doesn't seem to support weights in the likelihood. At least after the dataframe as init argument.

Regardless, it just a passing comment. But I believe it should be possible even with MCMC as it is just modified likelihood in the potential.

I can open another issue if it might be within the scope of future functionality