Closed ricardoV94 closed 1 year ago
About API, should we rename the current model.fit()
to model.sample()
and leave fit()
for the MLE/MAP approach? Or give the latter a separate name fit_MAP()
?
I would go with fit_MAP()
just because our 95% default will be sampling and .fit()
I think is what people would expect as the main fitting method.
Would model.sample()
be confused with pm.sample()
? Even though we're calling the latter implicitly
Would
model.sample()
be confused withpm.sample()
? Even though we're calling the latter implicitly
I don't see a problem with that
I would go with
fit_MAP()
just because our 95% default will be sampling and.fit()
I think is what people would expect as the main fitting method.
This gets my vote as well. When in doubt, go with the scikit-learn
conventions.
I do wonder if MAP warrants its own separate fitting method, or if it could be specified as a parameter in .fit()
instead.
I do wonder if MAP warrants its own separate fitting method, or if it could be specified as a parameter in .fit() instead.
The only thing I dislike about this approach is that the return type will be different depending on the input argument.
We can convert the map result to an Xarray so it works seamlessly with other methods that expect a posterior array. It just had 0 chains and samples
In conversation with @ColtAllen I realized the reason
find_MAP
gives worse results than lifetimes is (might be?) that lifetimes is often doing penalized MLE. We should investigate if we can add better priors so thatfind_MAP
is more competitive with the old lifetimes results in terms of stability/speed.This is important as we will probably want to provide users access to MAP/MLE fits.
When working with larger datasets (with over a million customers) I've noticed both MAP and MLE give identical results. However, the smaller the dataset, the more informative the prior needs to be. I've had good luck with Weibull priors.
I've been reconsidering the Datasets PR I've submitted, because although the CDNOW_sample dataset is the benchmark used in research, it's only about 2.3k customers and I can't get MAP to match MLE with a dataset that small. It seems MLE is the most reliable approach in the instance of small datasets.
I just tested the BG/NBD model with the full CDNOW_master dataset of 23,570 customers, and here's how the parameters compare with MAP vs. MLE:
# MAP
'a': 0.48014318,
'alpha': 41.74321353,
'b': 2.08961333,
'r': 0.26666963
#MLE
<btyd.BetaGeoFitter: fitted with 23570 subjects,
a: 0.4835169148279603,
alpha: 41.905447622667,
b: 2.111608102582983,
r: 0.267127774557681>
Here are the priors I used:
"alpha_prior_alpha": 1.0,
"alpha_prior_beta": 6.0,
"r_prior_alpha": 1.0,
"r_prior_beta": 1.0,
"phi_prior_lower": 0.0,
"phi_prior_upper": 1.0,
"kappa_prior_alpha": 1.0,
"kappa_prior_m": 1.5,
The penalizer in Lifetimes applies an L2-esque parameter shrinkage. Here's a snippet of the code in question:
penalizer_term = penalizer_coef * sum(np.asarray(params) ** 2)
regularized_log_like = log_likelihood.sum() / (1 + penalizer_term)
I've noticed this almost always improves the performance when predicting estimated future transactions, but it also almost always makes expected monetary value predictions worse for the Gamma-Gamma model.
Thinking more on this, I've realized it's a fallacious assumption that MLE will represent the ground truth for small amounts of data. If MAP and MLE are yielding identical results with sufficiently large datasets, I'm not sure if adding MLE to this library is a worthwhile effort.
Adding more clv model distribution classes would enable further evaluation of prior recommendations for small datasets, and suggest the volume of customers at which prior selection becomes trivial. These findings would be a valuable addition to the docs in my mind.
As for parameter shrinkage, it'd have to be applied during training iterations. Is this doable in pymc
?
I agree that we shouldn't add MLE methods. There is mathematical equivalence between MLE + penalty terms == MAP + priors
anyway, the two should give identical results. We could also explore adding something like pathfinder. More CLV distributions sounds like a good direction to go into.
We could also explore adding something like pathfinder.
I'm curious - what is pathfinder?
In conversation with @ColtAllen I realized the reason
find_MAP
gives worse results than lifetimes is (might be?) that lifetimes is often doing penalized MLE. We should investigate if we can add better priors so thatfind_MAP
is more competitive with the old lifetimes results in terms of stability/speed.This is important as we will probably want to provide users access to MAP/MLE fits.