Closed drbenvincent closed 1 year ago
@drbenvincent my idea was for these datasets to be stored in the github repo. Then you can use pd.read_csv(url)
to load them. But I was doing those examples in the same PR where I added the datasets, so I couldn't use the github urls yet.
I previously opened a PR for CLV datasets, but afterward was unsure of the value-add. I can recreate it if there's interest in using them for the documentation.
Do the CausalPy datasets not require a MANIFEST.in file?
Do the CausalPy datasets not require a MANIFEST.in file?
I believe because we use pyproject.toml
exclusively (no setup.py
) then MANIFEST.in
isn't needed. But aspect of Python is new territory for me
I think it makes sense to go with @ricardoV94 's intended approach of downloading datasets from the repo. It's certainly lower effort and involves less code. So how about we go with that approach and potentially revisit this idea if it becomes necessary in the future?
Ad the moment, the live MMM example builds a simulated dataset and the CLV examples load dataset from the lifetimes package.
But at some point we will likely want to include example datasets in
pymc-marketing
. In fact this is done by @ricardoV94 in a demo notebook in a PR (both MMM and CLV).These notebooks import data like this, for example
The problem is that this will only work in the context of a local development environment. It won't work for a regular user who has pip installed the package.
So I propose that we add in dataset loading functionality like we do in CausalPy here. This was inspired by, and is a more minimal implementation, of the functionality in Bambi. Using
pathlib
also helps avoid OS specific headaches that could arise.Doing this will make the docs / example notebooks easier to write. It will make the importing of datasets easier and more robust for a user. It's important to avoid failure points for first time users, so I propose doing this as a priority before the release announcement.