Open sp7412 opened 3 years ago
Probably it means that it's out of date as well. Needs some care.
pip
fails to install pandas 0.23.4 which is pretty old, yes.
I think it's not from the age of packages, the container is limited in CPU and memory. It takes infinite time to compile Pandas.
I don't think a newer version will solve the problem. A precompiled version from apt
may run. (And, even in that case, I wonder how long would it need to train a model.)
I think we can remove this scenario completely.
I installed python3-pandas
and python3-sklearn
from apt
packages. It looks SVM part of the tutorial can be run. I'm not sure about the torch
/CNN part. (At someplace, it asks to install torch==1.0.0
but it cannot be found. torch-1.4.0
can be installed but the training script cannot find it. I need to look into that.)
We can have some SVM parameterization and use this as an experiments tutorial. The downloaded data is not image, though, it's CSV file extracted from images. I can add featurization as well.
All the content and commands should change though, dvc run
has -f
parameters, and there are many parts like this:
Actually removing may be OK. It needs a total rewrite. I can create a new one using https://github.com/iterative/dvc-checkpoints-mnist
WDYT? @jorgeorpinel @shcheklein @dberenbaum
I'm fine to remove and start from the Dave's one.
I tested the basic
branch of dvc-checkpoints-mnist
and it's killed in the training step due to the memory limits in katacoda. Do you have a preference over which deep learning architecture / library / technique used in the examples? It may be possible to use TFLite and download models directly in the examples. I'm asking this because it may take more time to adjust to low memory environments.
Katacoda has 1.5GB of RAM. I can create a Docker environment to simulate this.
BTW I read Docker discussions in https://github.com/iterative/dvc.org/pull/811 and https://github.com/iterative/dvc/pull/2844
Instead of a general purpose Docker image, a container to download the data and set up the example project may be provided in the docs. We can use for the tests and if they like, people may build on top of these or create their own Docker environments.
@shcheklein @dberenbaum @jorgeorpinel
@iesahin do we know what takes all the memory? It's a bit unexpected that MNIST requires that much RAM.
@shcheklein I didn't profile it thoroughly but the line in training that builds the prediction, y_pred = model(x)
or something like that causes the kill. (I'm writing from the phone.) Data itself is downloaded, and loaded into the memory but the model may take that much RAM.
There may be some engineering, like increasing the swap space or manual gc to reduce the required memory. But Torch itself is a bit expensive library to run with 1-1.5 RAM+1GB swap.
There may be some different versions of the classifiers, like random forest, SVM, NB, CNN, MLP etc. to test & experiment. (Selected via parameters in DVC.) We can use the modest ones in katacoda, but the users may try all of them in their own environment.
I can (and probably should) load mini-batches of data in the example, which could help, but maybe not if PyTorch already uses almost all the available memory. We could also try a more lightweight deep learning framework. Also curious which branch you are using @iesahin?
Let me first profile the script. I doubt mini-batches will solve the memory problem, (PyTorch download size is around 700MB,) but it may help to converge faster. It takes around 100 epochs to reach >0.90 accuracy.
I tested on several branches but traced on basic
branch. Thank you.
If libtorch.so
really takes up 1.2GB of RAM as discussed here, there's not much we can do about it.
I tested dogs and cats data and model versioning tutorial on katacoda in a docker container: https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial
Tensorflow runs but creating the model takes a long time. python train.py
takes around 30 minutes. Most of the time is spent before the epoch progress bars. It may be possible to load the model at once and reduce this considerably.
But I'm not sure if it can be done near instantly. We still may need to have smaller datasets/models for Katacoda.
I'll also test MNIST dataset with TF on katacoda. TF seems more suitable for low memory environments. MNIST is known better and it's like Hello World for the ML tutorials.
@shcheklein @dberenbaum
I tested the MNIST example in TF site.
https://gist.github.com/iesahin/f3a22ebca5b52579748dc7d724047c8d
It takes less than 1 minute for the whole script to finish on Katacoda. The model is a bit simple, no CNN, 1 Dense/128 layer. (97% val. acc.) But at least now we know it's possible to use MNIST on Katacoda.
In Step 1: pip install -r requirements.txt fails to run.