build dataset loading functionality

drbenvincent commented 1 year ago

Ad the moment, the live MMM example builds a simulated dataset and the CLV examples load dataset from the lifetimes package.

But at some point we will likely want to include example datasets in pymc-marketing. In fact this is done by @ricardoV94 in a demo notebook in a PR (both MMM and CLV).

These notebooks import data like this, for example

data = pandas.read_csv("../../../datasets/clv_quickstart.csv")

The problem is that this will only work in the context of a local development environment. It won't work for a regular user who has pip installed the package.

So I propose that we add in dataset loading functionality like we do in CausalPy here. This was inspired by, and is a more minimal implementation, of the functionality in Bambi. Using pathlib also helps avoid OS specific headaches that could arise.

Doing this will make the docs / example notebooks easier to write. It will make the importing of datasets easier and more robust for a user. It's important to avoid failure points for first time users, so I propose doing this as a priority before the release announcement.

ricardoV94 commented 1 year ago

@drbenvincent my idea was for these datasets to be stored in the github repo. Then you can use pd.read_csv(url) to load them. But I was doing those examples in the same PR where I added the datasets, so I couldn't use the github urls yet.

ColtAllen commented 1 year ago

I previously opened a PR for CLV datasets, but afterward was unsure of the value-add. I can recreate it if there's interest in using them for the documentation.

Do the CausalPy datasets not require a MANIFEST.in file?

drbenvincent commented 1 year ago

Do the CausalPy datasets not require a MANIFEST.in file?

I believe because we use pyproject.toml exclusively (no setup.py) then MANIFEST.in isn't needed. But aspect of Python is new territory for me

drbenvincent commented 1 year ago

I think it makes sense to go with @ricardoV94 's intended approach of downloading datasets from the repo. It's certainly lower effort and involves less code. So how about we go with that approach and potentially revisit this idea if it becomes necessary in the future?

pymc-labs / pymc-marketing

build dataset loading functionality #206