CLV Distribution RVs not Model-Specific

ColtAllen commented 1 year ago

I'm considering using the RVs in the CLV Distributions module to generate synthetic data for testing the Pareto/NBD model. However, after looking at the rng_fn for both classes, I'm concerned the RVs may not be robust across all model types, and the distribution classes could have similar pathologies.

As currently defined, the sim_data method in rng_fn is using a binomial RV within a while loop for the dropout probability. This works well for the Modified BG/NBD model, but I do not see a provision for the BG/NBD assumption that all non-repeat customers are alive with probability 1. The Pareto/NBD also does not use a binomial RV at all - instead it uses an exponential RV to predict the dropout time period prior to the while loop.

The data generation functions in Lifetimes/BTYD are a useful reference:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py

ricardoV94 commented 1 year ago

We should add those variants to generate data. Do you want to assign this issue to yourself?

ColtAllen commented 1 year ago

I consider this a prerequisite for https://github.com/pymc-labs/pymc-marketing/issues/127, so I'll get started on adding a ParetoNBD distribution sometime next week.

The existing distributions should also be revised to reflect being written with the BG/NBD model in mind. I can also look into vectorization for sim_data.

ColtAllen commented 1 year ago

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

larryshamalama commented 1 year ago

I'll be creating a PR for this soon. @larryshamalama I see you made the original commit for the ContContract distribution class. Is there a research citation you can provide for me to add to the docstring?

Sorry, this message slipped by my attention. I did not use any research articles since I did not find any when I was writing out the likelihood... Perhaps there is one out there that I'm unaware of. I can write out the likelihood derivation again if it would help

ColtAllen commented 1 year ago

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

[ ] Update docstring of ContContract/BetaGeoNBDAggregate
[ ] If we change the logp in ContContract/BetaGeoNBDAggregate per https://github.com/pymc-labs/pymc-marketing/issues/98, we can close that issue as well.
[ ] The T0 param should be removed from both distribution classes because it will left-censor customer data. Functionality to select study start times is a good addition to utils.clv_summary() if you want to create an issue for it.
[ ] The RVs for both classes will have identical sim_data methods, which should be refactored like so:

        def sim_data(lam, p, T):
            t = 0 # recency
            n = 0 # frequency

           churn = 0 # BG/NBD assumes all non-repeat customers are active 
           wait = rng.exponential(scale=1 / lam)

           while t + wait < T and not churn:
               n += 1
               churn = rng.binomial(n=1, p=p)
               t += wait
               wait = rng.exponential(scale=1 / lam)

           return np.array([t, n])

larryshamalama commented 1 year ago

Sounds like a good plan, thanks for laying a bullet point style action plan. I'm away until early March, we can chat once I'll be back to work.

larryshamalama commented 1 year ago

@larryshamalama Let's refactor ContNonContract and ContContractinto BetaGeoNBD and BetaGeoNBDAggregate, respectively, so we can close this out:

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

The reason why I initially added ContContract and ContNonContract is because I felt like those were the primary building blocks for continuous CLVs. In other words, my understanding (at the time) were that models, including BG/NBD, stem from having a same or very similar data-generating process but marginalizing over different priors. Admittedly, how useful these distribution classes will be is unclear to me. We can converse about these ideas some time soon. What do you think?

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

ColtAllen commented 1 year ago

Hi @ColtAllen, I am just getting back to work and browsing the current progress that has being made. My understanding is that your focus is, for now, #177 and #176. I can start with #98 and we can see from there. How does that sound?

Sounds great 👍

Edit: Re-reading your original comment in opening this issue, I see where you are coming from. I'm still wondering if there's a better way in generalizing model building blocks and making them robust for all (otherwise most/many) model types.

My main interest in model-specific distribution blocks is for use within the model like I'm doing in https://github.com/pymc-labs/pymc-marketing/pull/177, unlocking additional functionality. That said, it could be interesting to test how well the ParetoNBD model converges on data generated from a BG/NBD process, and vice-versa. If there isn't interest in adding an individual-level BG/NBD model, we don't have a means of generating raw transaction data yet, so that could be a better way to repurpose that particular distribution block.

larryshamalama commented 1 year ago

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

ColtAllen commented 1 year ago

Shall we modify the building blocks to be specific to CLV models? E.g. BGNBDRV akin to ParetoNBD. This would entail reworking ContContract and ContNonContract that we currently have.

IIRC, we opted against this because all we needed was the logp method which could be provided via pm.Potential. Adding these as distribution classes would have rng_fns available for use. What do people think?

@larryshamalama let's rework ContNonContract into a distribution block for raw transaction data, because the other two blocks generate data in recency/frequency summary format. You can work off of the corresponding lifetimes function here:

https://github.com/ColtAllen/btyd/blob/main/btyd/generate_data.py#L75

The reason I suggest this is because if you recall our last weekly project meeting, @twiecki wants all lifetimes functionality in this notebook added to pymc-marketing:

https://github.com/ColtAllen/marketing-case-study/blob/main/case-study.ipynb

And in time, the notebook itself added to the docs. The first thing we need is a raw transaction block to generate the synthetic data.

We should create issues for the other lifetimes utility and plotting functions in that notebook as well.

pymc-labs / pymc-marketing

CLV Distribution RVs not Model-Specific #128