Consider using "penalized" priors by default in CLV models

ricardoV94 commented 1 year ago

In conversation with @ColtAllen I realized the reason find_MAP gives worse results than lifetimes is (might be?) that lifetimes is often doing penalized MLE. We should investigate if we can add better priors so that find_MAP is more competitive with the old lifetimes results in terms of stability/speed.

This is important as we will probably want to provide users access to MAP/MLE fits.

ricardoV94 commented 1 year ago

About API, should we rename the current model.fit() to model.sample() and leave fit() for the MLE/MAP approach? Or give the latter a separate name fit_MAP()?

twiecki commented 1 year ago

I would go with fit_MAP() just because our 95% default will be sampling and .fit() I think is what people would expect as the main fitting method.

larryshamalama commented 1 year ago

Would model.sample() be confused with pm.sample()? Even though we're calling the latter implicitly

ricardoV94 commented 1 year ago

Would model.sample() be confused with pm.sample()? Even though we're calling the latter implicitly

I don't see a problem with that

ColtAllen commented 1 year ago

I would go with fit_MAP() just because our 95% default will be sampling and .fit() I think is what people would expect as the main fitting method.

This gets my vote as well. When in doubt, go with the scikit-learn conventions.

I do wonder if MAP warrants its own separate fitting method, or if it could be specified as a parameter in .fit() instead.

twiecki commented 1 year ago

I do wonder if MAP warrants its own separate fitting method, or if it could be specified as a parameter in .fit() instead.

The only thing I dislike about this approach is that the return type will be different depending on the input argument.

ricardoV94 commented 1 year ago

We can convert the map result to an Xarray so it works seamlessly with other methods that expect a posterior array. It just had 0 chains and samples

ColtAllen commented 1 year ago

In conversation with @ColtAllen I realized the reason find_MAP gives worse results than lifetimes is (might be?) that lifetimes is often doing penalized MLE. We should investigate if we can add better priors so that find_MAP is more competitive with the old lifetimes results in terms of stability/speed.

This is important as we will probably want to provide users access to MAP/MLE fits.

When working with larger datasets (with over a million customers) I've noticed both MAP and MLE give identical results. However, the smaller the dataset, the more informative the prior needs to be. I've had good luck with Weibull priors.

I've been reconsidering the Datasets PR I've submitted, because although the CDNOW_sample dataset is the benchmark used in research, it's only about 2.3k customers and I can't get MAP to match MLE with a dataset that small. It seems MLE is the most reliable approach in the instance of small datasets.

I just tested the BG/NBD model with the full CDNOW_master dataset of 23,570 customers, and here's how the parameters compare with MAP vs. MLE:

# MAP
 'a': 0.48014318,
 'alpha': 41.74321353,
 'b': 2.08961333,
 'r': 0.26666963

#MLE
<btyd.BetaGeoFitter: fitted with 23570 subjects, 
a: 0.4835169148279603, 
alpha: 41.905447622667, 
b: 2.111608102582983, 
r: 0.267127774557681>

Here are the priors I used:

betageo_graph

"alpha_prior_alpha": 1.0,
"alpha_prior_beta": 6.0,
"r_prior_alpha": 1.0,
"r_prior_beta": 1.0,
"phi_prior_lower": 0.0,
"phi_prior_upper": 1.0,
"kappa_prior_alpha": 1.0,
"kappa_prior_m": 1.5,

ColtAllen commented 1 year ago

The penalizer in Lifetimes applies an L2-esque parameter shrinkage. Here's a snippet of the code in question:

penalizer_term = penalizer_coef * sum(np.asarray(params) ** 2)

regularized_log_like = log_likelihood.sum() / (1 + penalizer_term)

I've noticed this almost always improves the performance when predicting estimated future transactions, but it also almost always makes expected monetary value predictions worse for the Gamma-Gamma model.

ColtAllen commented 1 year ago

Thinking more on this, I've realized it's a fallacious assumption that MLE will represent the ground truth for small amounts of data. If MAP and MLE are yielding identical results with sufficiently large datasets, I'm not sure if adding MLE to this library is a worthwhile effort.

Adding more clv model distribution classes would enable further evaluation of prior recommendations for small datasets, and suggest the volume of customers at which prior selection becomes trivial. These findings would be a valuable addition to the docs in my mind.

As for parameter shrinkage, it'd have to be applied during training iterations. Is this doable in pymc?

twiecki commented 1 year ago

I agree that we shouldn't add MLE methods. There is mathematical equivalence between MLE + penalty terms == MAP + priors anyway, the two should give identical results. We could also explore adding something like pathfinder. More CLV distributions sounds like a good direction to go into.

ColtAllen commented 1 year ago

We could also explore adding something like pathfinder.

I'm curious - what is pathfinder?

twiecki commented 1 year ago

https://www.pymc.io/projects/examples/en/latest/variational_inference/pathfinder.html

pymc-labs / pymc-marketing

Consider using "penalized" priors by default in CLV models #99