iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

`example-dvc-experiments`: update and make it runnable on macOS #113

Closed iesahin closed 1 year ago

iesahin commented 2 years ago

It looks running example-dvc-experiments in macOS is not smooth as it should be. We would like to update the project with the following goals.

I think we first should decide on the library, and create a problem around it. The library should have a small footprint enough to run on Katacoda, easy to maintain, and DVCLive should support it.

Possible DVCLive ML Frameworks are like:

I believe instead of a specialized PyTorch framework like Catalyst of MMCV, we can use PyTorch. In this case, we have two major candidates for libraries, PyTorch and XGBoost.

The other issue is the dataset: We can continue to use Fashion-MNIST or MNIST, but from the feedback, these seems a bit boring. Some more interesting alternatives:

These are rather large datasets, but we can either use PyTorch datasets to get them, or use a subset (e.g. 5-10 classes) to reduce the required amount.

Also, the problem can be something other than classification but IMO, classification is easier to tell. "We have these images, we have to tell which one is which..."

iesahin commented 2 years ago

We also have a 2800 image Cats & Dogs dataset and a related project. The project uses TF but it could be updated with PyTorch.

iesahin commented 2 years ago

I'd like your input here: @dberenbaum @daavoo @shcheklein

daavoo commented 2 years ago

I would go with https://lightning-flash.readthedocs.io/en/latest/reference/image_classification.html . Based on Pytorch, installation support is the same. Supported in DVCLive . Simplest training code. Set 3 intuitive parameters: backbone, image_size and epochs.

I vote for cats vs dogs. They are cute. We already have it and "own" an extended version. Can include "data-centric" experiment (initial vs extended version) Easy for people to test the resulting model with a random image of their pet. The size of the dataset is similar to most real-world computer vision projects.


A completely different option could be to use HuggingFace for Text classification (https://huggingface.co/docs/transformers/tasks/sequence_classification) . The project can be set up to support Tensorflow and Pytorch with the same code.

For the dataset, we can use Github issues and associated label from DVC Repo, I used this in my workshop (https://github.com/iterative/workshop-uncool-mlops/blob/main/src/get_data.py)

iesahin commented 2 years ago

The current prototype project is in https://github.com/iterative/example-dvc-prototype

It uses pytorch-lightning for transfer learning with Resnet-18 on cats& dogs dataset