`example-dvc-experiments`: update and make it runnable on macOS

iesahin commented 2 years ago

It looks running example-dvc-experiments in macOS is not smooth as it should be. We would like to update the project with the following goals.

we need a project that is somewhat realistic, looks cool visually in VS Code, very simple to install and run everywhere
we need live metrics, checkpoints
we need it to be as simple as possible w/o destroying the meaning completely

I think we first should decide on the library, and create a problem around it. The library should have a small footprint enough to run on Katacoda, easy to maintain, and DVCLive should support it.

Possible DVCLive ML Frameworks are like:

Catalyst: Being a PyTorch framework, I assume it has the same requirements as PyTorch.
Fast.ai says "You can install fastai on your own machines with conda (highly recommended), as long as you're running Linux or Windows (NB: Mac is not supported)." 🙅
LightGBM Although can be installed with brew install lightgbm on macOs, Linux and Windows installation seems unnecessarily complicated.
MMCV also depends on PyTorch and probably has the same requirements.
PyTorch says Apple M1 support is "wheel only." It requires Anaconda. I'll test to see how comfortable this is.
XGBoost can be installed to M1 with pip install xgboost, although it lacks the GPU support.

I believe instead of a specialized PyTorch framework like Catalyst of MMCV, we can use PyTorch. In this case, we have two major candidates for libraries, PyTorch and XGBoost.

The other issue is the dataset: We can continue to use Fashion-MNIST or MNIST, but from the feedback, these seems a bit boring. Some more interesting alternatives:

Leaf classification: https://www.kaggle.com/datasets/amandam1/healthy-vs-diseased-leaf-image-dataset
Aircraft classification: https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
Flowers: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
Food: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/
Traffic signs: https://benchmark.ini.rub.de/gtsrb_dataset.html

These are rather large datasets, but we can either use PyTorch datasets to get them, or use a subset (e.g. 5-10 classes) to reduce the required amount.

Also, the problem can be something other than classification but IMO, classification is easier to tell. "We have these images, we have to tell which one is which..."

iesahin commented 2 years ago

We also have a 2800 image Cats & Dogs dataset and a related project. The project uses TF but it could be updated with PyTorch.

iesahin commented 2 years ago

I'd like your input here: @dberenbaum @daavoo @shcheklein

daavoo commented 2 years ago

Framework

I would go with https://lightning-flash.readthedocs.io/en/latest/reference/image_classification.html . Based on Pytorch, installation support is the same. Supported in DVCLive . Simplest training code. Set 3 intuitive parameters: backbone, image_size and epochs.

Dataset

I vote for cats vs dogs. They are cute. We already have it and "own" an extended version. Can include "data-centric" experiment (initial vs extended version) Easy for people to test the resulting model with a random image of their pet. The size of the dataset is similar to most real-world computer vision projects.

A completely different option could be to use HuggingFace for Text classification (https://huggingface.co/docs/transformers/tasks/sequence_classification) . The project can be set up to support Tensorflow and Pytorch with the same code.

For the dataset, we can use Github issues and associated label from DVC Repo, I used this in my workshop (https://github.com/iterative/workshop-uncool-mlops/blob/main/src/get_data.py)

iesahin commented 2 years ago

The current prototype project is in https://github.com/iterative/example-dvc-prototype

It uses pytorch-lightning for transfer learning with Resnet-18 on cats& dogs dataset

iterative / example-repos-dev

`example-dvc-experiments`: update and make it runnable on macOS #113