Open ColtAllen opened 1 year ago
We should add those variants to generate data. Do you want to assign this issue to yourself?
I consider this a prerequisite for https://github.com/pymc-labs/pymc-marketing/issues/127, so I'll get started on adding a ParetoNBD distribution sometime next week.
The existing distributions should also be revised to reflect being written with the BG/NBD model in mind. I can also look into vectorization for sim_data
.
I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract
distribution class. Is there a research citation you can provide for me to add to the docstring?
I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the
ContContract
distribution class. Is there a research citation you can provide for me to add to the docstring?
Sorry, this message slipped by my attention. I did not use any research articles since I did not find any when I was writing out the likelihood... Perhaps there is one out there that I'm unaware of. I can write out the likelihood derivation again if it would help
@larryshamalama Let's refactor ContNonContract
and ContContract
into BetaGeoNBD
and BetaGeoNBDAggregate
, respectively, so we can close this out:
[ ] Update docstring of ContContract
/BetaGeoNBDAggregate
[ ] If we change the logp
in ContContract
/BetaGeoNBDAggregate
per https://github.com/pymc-labs/pymc-marketing/issues/98, we can close that issue as well.
[ ] The T0
param should be removed from both distribution classes because it will left-censor customer data. Functionality to select study start times is a good addition to utils.clv_summary()
if you want to create an issue for it.
[ ] The RVs for both classes will have identical sim_data
methods, which should be refactored like so:
def sim_data(lam, p, T):
t = 0 # recency
n = 0 # frequency
churn = 0 # BG/NBD assumes all non-repeat customers are active
wait = rng.exponential(scale=1 / lam)
while t + wait < T and not churn:
n += 1
churn = rng.binomial(n=1, p=p)
t += wait
wait = rng.exponential(scale=1 / lam)
return np.array([t, n])
Sounds like a good plan, thanks for laying a bullet point style action plan. I'm away until early March, we can chat once I'll be back to work.
@larryshamalama Let's refactor
ContNonContract
andContContract
intoBetaGeoNBD
andBetaGeoNBDAggregate
, respectively, so we can close this out:
Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?
The reason why I initially added Admittedly, how useful these distribution classes will be is unclear to me. We can converse about these ideas some time soon. What do you think?ContContract
and ContNonContract
is because I felt like those were the primary building blocks for continuous CLVs. In other words, my understanding (at the time) were that models, including BG/NBD, stem from having a same or very similar data-generating process but marginalizing over different priors.
Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.
Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?
Sounds great 👍
Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.
My main interest in model-specific distribution blocks is for use within the model like I'm doing in https://github.com/pymc-labs/pymc-marketing/pull/177, unlocking additional functionality. That said, it could be interesting to test how well the ParetoNBD model converges on data generated from a BG/NBD process, and vice-versa. If there isn't interest in adding an individual-level BG/NBD model, we don't have a means of generating raw transaction data yet, so that could be a better way to repurpose that particular distribution block.
Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV
akin to ParetoNBD
. This would entail reworking ContContract
and ContNonContract
that we currently have.
IIRC, we opted against this because all we needed was the logp
method which could be provided via pm.Potential
. Adding these as distribution classes would have rng_fn
s available for use. What do people think?
Shall we modify the building blocks to be specific to CLV models? E.g.
BGNBDRV
akin toParetoNBD
. This would entail reworkingContContract
andContNonContract
that we currently have.IIRC, we opted against this because all we needed was the
logp
method which could be provided viapm.Potential
. Adding these as distribution classes would haverng_fn
s available for use. What do people think?
@larryshamalama let's rework ContNonContract
into a distribution block for raw transaction data, because the other two blocks generate data in recency/frequency summary format. You can work off of the corresponding lifetimes
function here:
https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py#L75
The reason I suggest this is because if you recall our last weekly project meeting, @twiecki wants all lifetimes
functionality in this notebook added to pymc-marketing
:
https://github.com/ColtAllen/marketing-case-study/blob/main/case-study.ipynb
And in time, the notebook itself added to the docs. The first thing we need is a raw transaction block to generate the synthetic data.
We should create issues for the other lifetimes
utility and plotting functions in that notebook as well.
I'm considering using the RVs in the CLV Distributions module to generate synthetic data for testing the Pareto/NBD model. However, after looking at the
rng_fn
for both classes, I'm concerned the RVs may not be robust across all model types, and the distribution classes could have similar pathologies.As currently defined, the
sim_data
method inrng_fn
is using a binomial RV within a while loop for the dropout probability. This works well for the Modified BG/NBD model, but I do not see a provision for the BG/NBD assumption that all non-repeat customers are alive with probability 1. The Pareto/NBD also does not use a binomial RV at all - instead it uses an exponential RV to predict the dropout time period prior to the while loop.The data generation functions in Lifetimes/BTYD are a useful reference:
https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py