Closed wd60622 closed 11 months ago
It would make sense imo. This is about pre-packaged models for specific types of data so anything that helps with the workflow sounds useful.
I would only pay attention to keep them dumb and try not to implement similar things multiple times.
Totally, @ricardoV94
I'll get started with this type of plot first then and can discuss in PR if repetitive
I like this plot idea a lot! Have you done it with large datasets? Many people doing BTYD modeling work with datasets involving millions of customers, so I'm curious how this would scale. Some of the other plotting functions take more time to compile than to fit a model (at least for MAP fits).
Obviously sampling is one approach to get sense of the data. Maybe there is a spin on this plot that takes a histogrammed data as input.
For instance,
data = [
(0, 1, 10),
(0, 2, 30),
...,
(
]
df_hist = pd.DataFrame(data, columns=["recency", "T", "count"])
Maybe a helper to make this binned set.
Might have to be rectangles instead of lines and then no points. X axis is same and Y is maybe cum percentage or count of the count column
Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool
Had something like this in mind. This is created from the whole data set instead of just a sample of 100 customers.
Results will be binned before plotting, so this could be customizable
Y axis could support percentages as well as the counts shown above
Input data could drastically reduce depending on the bin size
recency T count
0 0.0 27.0 395
1 0.0 30.0 411
2 0.0 33.0 404
3 0.0 36.0 319
4 3.0 27.0 18
5 3.0 30.0 22
6 3.0 33.0 21
7 3.0 36.0 22
8 6.0 27.0 25
.. ... ... ...
[45 rows x 3 columns]
Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool
lifetimes
does this prior to model fitting. As a utility function this could improve the performance of MAP fits, but I don't know if it's viable for MCMC sampling.
Maybe there's a way to fit with that data too? Weighting the likelihood by counts. Think that'd be cool
lifetimes
does this prior to model fitting. As a utility function this could improve the performance of MAP fits, but I don't know if it's viable for MCMC sampling.
Are you talking about the use of the weights here? https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L189-L197
Are you talking about the use of the weights here? https://github.com/CamDavidsonPilon/lifetimes/blob/41e394923ad72b17b5da93e88cfabab43f51abe2/lifetimes/fitters/beta_geo_fitter.py#L189-L197
Yes; both lifetimes
and pymc.Model.find_map()
call scipy.optimize.minimize
under the hood, but lifetimes
applies these weights to the logp
expression prior to being passed into the optimizer.
@ricardoV94 correct me if I'm wrong, but it seems sample weights functionality would require changes within pymc
, and the fact Customer ID is an Xarray dimension will inhibit this regardless.
Even for datasets with millions of customers, find_MAP
usually finishes within a minute unless there are data quality issues impeding convergence. This in mind, we should probably keep this weights/counts function within the plotting module. It might also be useful for the matrix plots.
Pymc-marketing doesn't seem to support weights in the likelihood. At least after the dataframe as init argument.
Regardless, it just a passing comment. But I believe it should be possible even with MCMC as it is just modified likelihood in the potential.
I can open another issue if it might be within the scope of future functionality
In conjunction to the evaluation plots initiative here, would it make sense to support some plots on the input data set? Just in order to quickly glance in the data sets being used in the model.
I currently have a "Customer Exposure" plot that might be a candidate here.
I like this plot because it quickly shows that many customers have likely stopped making purchases.
It also visually shows the definitions of the "recency" and "T". I often find the "recency" definition is confusing to those not use to the models and this might provide some clarity to the required format.
The code to create this plot might look like this:
To be clear, my anti-goal here would be to provide wrappers around histograms and simple to make plots