Decreasing base dependencies

guybuk commented 1 week ago

Currently we only have pip install gataset and pip install gataset[dev]. the base dependencies are very very large, including: torch, torchvision, datasets, holoviews, panel, imageio... and more.

In practice, our goal is to provide a "core" dataset library, which will only depend on basic python libraries like numpy and pandas, and additional "modal" requirements like:

gataset[vision]
gataset[text]
gataset[audio] etc.

The question remains, then, what should be included in such an extra?

For example, let's look at vision:

Should we include libraries like albumentations, holoviews, panel, just because our favorite downstream implementations of rendering engines and transforms use these? what if someone wants to implement transforms using torchvision?
should we include imageio? what if someone wants to implement their own image-reading?
what about pycocotools or torchvision? who mainly exist for dataset providers? many users won't even use the providers, since they'll implement their own. And if we add torchvision, this forces us to depend on torch, and the problem even worsens.

guybuk commented 1 week ago

Use cases

Looking at it from the user's POV, I can think of three types:

A user that knows what they're doing, but want to do their own thing:
- they'll download the core library and likely not engage with any tutorials or notebooks
A user that knows what they're doing, and they also like the stock implementations (e.g. albumentations, holoviews, pytorch):
- will not engage tutorials, but will download bridge-ds[vision/text/...]
A user that doesn't know what they're doing
- will engage in the tutorials of the domain they like (for starters we are doing only vision tutorials)
- they'll either engage with the notebooks (requires downloading opinionated) or read the docs (requires nothing) and become a user that knows what they're doing ^
A developer for bridge-ds
- Will need to download testing tools, in addition to whatever they're working on (vision, text, pytorch, etc)

Proposition:

The core library should be as lean as possible
If users want to use the notebooks, they should download our opinionated extras, or manage with the printed docs
DL Engine libraries like PyTorch or TF are expected to be installed separately from bridge-ds, as well as jupyter notebooks

Considering our current situation, it spells out this way:

core - pandas, numpy vision - skimage, holoviews, panel, hvplot, pycocotools, albumentations (NOTE: pycocotools is excluded and will be downloaded for tutorials only), tabulate dev - pytest, pytest-mock, pre-commit, testbook

guybuk commented 1 week ago

Closed with #8

guybuk commented 1 week ago

Closed with #8

guybuk / bridge-ds

Decreasing base dependencies #1

Use cases

Proposition: