iterative / katacoda-scenarios

Interactive Katacoda Scenarios
https://www.katacoda.com/dvc
2 stars 11 forks source link

mnist tutorial failing #15

Open sp7412 opened 3 years ago

sp7412 commented 3 years ago

In Step 1: pip install -r requirements.txt fails to run.

shcheklein commented 3 years ago

Probably it means that it's out of date as well. Needs some care.

jorgeorpinel commented 3 years ago

pip fails to install pandas 0.23.4 which is pretty old, yes.

iesahin commented 3 years ago

I think it's not from the age of packages, the container is limited in CPU and memory. It takes infinite time to compile Pandas.

Screen Shot 2021-03-14 at 16 37 58

I don't think a newer version will solve the problem. A precompiled version from apt may run. (And, even in that case, I wonder how long would it need to train a model.)

I think we can remove this scenario completely.

I installed python3-pandas and python3-sklearn from apt packages. It looks SVM part of the tutorial can be run. I'm not sure about the torch/CNN part. (At someplace, it asks to install torch==1.0.0 but it cannot be found. torch-1.4.0 can be installed but the training script cannot find it. I need to look into that.)

We can have some SVM parameterization and use this as an experiments tutorial. The downloaded data is not image, though, it's CSV file extracted from images. I can add featurization as well.

All the content and commands should change though, dvc run has -f parameters, and there are many parts like this:

Screen Shot 2021-03-14 at 17 08 36

Actually removing may be OK. It needs a total rewrite. I can create a new one using https://github.com/iterative/dvc-checkpoints-mnist

WDYT? @jorgeorpinel @shcheklein @dberenbaum

shcheklein commented 3 years ago

I'm fine to remove and start from the Dave's one.

iesahin commented 3 years ago

I tested the basic branch of dvc-checkpoints-mnist and it's killed in the training step due to the memory limits in katacoda. Do you have a preference over which deep learning architecture / library / technique used in the examples? It may be possible to use TFLite and download models directly in the examples. I'm asking this because it may take more time to adjust to low memory environments.

Katacoda has 1.5GB of RAM. I can create a Docker environment to simulate this.

BTW I read Docker discussions in https://github.com/iterative/dvc.org/pull/811 and https://github.com/iterative/dvc/pull/2844

Instead of a general purpose Docker image, a container to download the data and set up the example project may be provided in the docs. We can use for the tests and if they like, people may build on top of these or create their own Docker environments.

@shcheklein @dberenbaum @jorgeorpinel

shcheklein commented 3 years ago

@iesahin do we know what takes all the memory? It's a bit unexpected that MNIST requires that much RAM.

iesahin commented 3 years ago

@shcheklein I didn't profile it thoroughly but the line in training that builds the prediction, y_pred = model(x) or something like that causes the kill. (I'm writing from the phone.) Data itself is downloaded, and loaded into the memory but the model may take that much RAM.

There may be some engineering, like increasing the swap space or manual gc to reduce the required memory. But Torch itself is a bit expensive library to run with 1-1.5 RAM+1GB swap.

There may be some different versions of the classifiers, like random forest, SVM, NB, CNN, MLP etc. to test & experiment. (Selected via parameters in DVC.) We can use the modest ones in katacoda, but the users may try all of them in their own environment.

dberenbaum commented 3 years ago

I can (and probably should) load mini-batches of data in the example, which could help, but maybe not if PyTorch already uses almost all the available memory. We could also try a more lightweight deep learning framework. Also curious which branch you are using @iesahin?

iesahin commented 3 years ago

Let me first profile the script. I doubt mini-batches will solve the memory problem, (PyTorch download size is around 700MB,) but it may help to converge faster. It takes around 100 epochs to reach >0.90 accuracy.

I tested on several branches but traced on basic branch. Thank you.

iesahin commented 3 years ago

If libtorch.so really takes up 1.2GB of RAM as discussed here, there's not much we can do about it.

iesahin commented 3 years ago

I tested dogs and cats data and model versioning tutorial on katacoda in a docker container: https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial

Tensorflow runs but creating the model takes a long time. python train.py takes around 30 minutes. Most of the time is spent before the epoch progress bars. It may be possible to load the model at once and reduce this considerably.

But I'm not sure if it can be done near instantly. We still may need to have smaller datasets/models for Katacoda.

I'll also test MNIST dataset with TF on katacoda. TF seems more suitable for low memory environments. MNIST is known better and it's like Hello World for the ML tutorials.

@shcheklein @dberenbaum

iesahin commented 3 years ago

I tested the MNIST example in TF site.

https://gist.github.com/iesahin/f3a22ebca5b52579748dc7d724047c8d

It takes less than 1 minute for the whole script to finish on Katacoda. The model is a bit simple, no CNN, 1 Dense/128 layer. (97% val. acc.) But at least now we know it's possible to use MNIST on Katacoda.