"Out of memory" error. Can anything be done?

amandasaurus commented 5 years ago

rs train and rs predict fail with CUDA error: out of memory.

I have a nVidia GeForce GT 710 which is admittedly pretty low end. The specs say it has 2GB of memory, but nvidia-smi only says 978MB (~1 "giga" byte) 🤔.

When I set batch_size = 1, and images_size = 256 (and in rs download I download the 256x256 (ie. no @2x suffix)), I still get the same error. It takes a few seconds before I get a python error of RuntimeError: CUDA error: out of memory``, rather than ~1 second, so it feels like it's lasting more before OOMing. But it still fails. This happens onrs trainandrs predict`.

Is there any way to make robosat use less memory so that I can at least run this on my GPU rather than my CPU? Or must I just accept that my hardware isn't good enough? I know very little about graphics cards, cuda, or torch, or computer vision stuff.

I run it on my CPU with by installing torch with pip install --upgrade https://download.pytorch.org/whl/cpu/torch-0.4.0-cp36-cp36m-linux_x86_64.whl, and setting cuda = false in model.toml. It works, and for tiny areas, I can get results in a few hours. But I'd love if it could be faster.

$ nvidia-smi
Wed Oct 10 18:21:07 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   38C    P8    N/A /  N/A |    177MiB /   978MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

and

$ nvcc 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

My OS is Ubuntu 18.04.

daniel-j-h commented 5 years ago

Looks like you only have 978 MiB (~1GB) CUDA memory available on your GPU. That's quite low indeed.

Here's what you can try:

The models we use in rs train should be able to go as low as 32x32 pixels (the size needs to be divisible by 32 for the repeated factor of two downsampling and then upsampling to work out). You might run into border problems with images as small as 32x32 but maybe try 256, or 128 first. Keep the @2 suffix for high-dpi images, though. You want to change image size, not quality. Also you don't have to re-download the images. Instead what you can do is use mercantile's tile.children function and PIL's Image.crop function and then cut a downloaded 512x512 tile into its 4 256x256 tiles. Repeat to get 128x128. Repeat to get 64x64.
Use a smaller model; here and here we construct the model with its default num_filters=32 (see here). See if decreasing this for the decoder helps or if the peak memory usage is really due to the resnet encoder.
Try the FPN pull request (https://github.com/mapbox/robosat/pull/75) it should be lighter and give you the same if not better results.
Use a smaller pre-trained model encoder. Here we are using a resnet50 and here we create channels which are in sync with the resnet50. You can try a resnet34 or even resnet18 and then instead of 2048, 1024, 512, 256 channels use 512, 256, 128, 64. You can do the same trick in the FPN branch which combines the idea to use the lighter FPN and the idea here to use a smaller encoder.
Always run watch nvidia-smi (e.g. in a separate tmux pane) to check the GPU utilization and if you can increase your batch size further.

If it turns out it's still not fitting into your GPU memory or you learn that you will have to trade off too much model performance to get it to work (which both might be the case since we optimize robosat for server usage and not mobile devices) I strongly recommend getting either an AWS p2 instance or a similar GPU e.g. at paperspace.com.

You should be able to run and prepare every single step in the pipeline except rs train reasonably fast on your laptop and if you then need to spend 10-20 bucks to train on a cloud GPU for a day or two I think that's a fair investment we can expect from users.

amandasaurus commented 5 years ago

Thanks for your help, I'll look into those things. Although I previously mentioned using a laptop, I've since bought a desktop, looks like I didn't spend enough! 😛

It looks like AWS p2 start at 61GB of GPU ram, and nVidida GeForce GTX 1080 TI have 11GB. Is that the sort of "minimum memory" you're designing robosat for?

daniel-j-h commented 5 years ago

Th p2s K80 and the GTX 1080 TI both have around 12 GB of GPU memory. That's what we are using for prototyping, experimentation, training. But as I outlined above there's plenty of room to make it work on smaller GPUs and we only fully utilize the 12 GB right now with large batch sizes and 512x512 images.

Keep me posted how it works out for you!

amandasaurus commented 5 years ago

Thanks for your reply. I'm looking into changing the num_filters & resnet50, nothing to report yet. The FPN PR doesn't apply to master branch, so I can't test that.

The models we use in rs train should be able to go as low as 32x32 pixels ... cut a downloaded 512x512 tile into its 4 256x256 tiles. Repeat to get 128x128. Repeat to get 64x64.

Can you explain how to lay out this data? Should I turn one 512x512 z18 tile into 4 256x256 z19 tiles?, then into 8 127x127 zx20 tiles? (I can do this with a bash script + imagemagick) Then work on this new z20 tiles? Or do I split a 512x512 z18 tiles into 8 tiles and leave them in the same directory? But you mention python code (I've used python professionally for years), what would I change? Would that involve changing BufferedSlippyMapDirectory to somehow think there were more (but smaller) tiles? 😕

daniel-j-h commented 5 years ago

For training the tile's z doesn't matter so you can take your first approach.

amandasaurus commented 5 years ago

That's great. Do I need to change any settings? image_size is in the model toml file, I presume I should change that to whatever size I use?

daniel-j-h commented 5 years ago

Yeah the image size should be e.g. 256 if your slippy map tile images are of size 256x256.

In addition I just rebased the FPN branch https://github.com/mapbox/robosat/pull/75 feel free to give it a go.

amandasaurus commented 5 years ago

I had "success" with splitting the tiles up. With a batch=1, I was able to run rs train on my CPU and the speed was about 6sec per 512×512 tile. By splitting it 4 times, I got 32×32 images, and was just about able to squeeze that into 1GB of memory. It was ~930MB IIRC (batch=1 ofc). I got 0.6s per tile then. 10 tiles faster! But there are 64 times more images... So ~6½ times slower! 🤦

I can run rs predict using the graphics card, and that's faster than the CPU, and only takes ~300MB of memory on 512×512 images, and does ~20 tiles per second. So I can train on CPU, but predict on GPU.

I have noticed that I have a laptop with a 2GB nvidia card. I'll try that over the karlsruhe hack weekend.

amandasaurus commented 5 years ago

Progress: I'm running it on a machine (laptop) with 2004MiB of GPU memory. With batch=1 and splitting tiles once into 256×256, and it works! nvidia-smi tells me it's using 1329MiB of memory. rs train is running at 1.32 batch/sec (i.e 0.75 sec per batch). That's still about 10× faster than the CPU approach.

amandasaurus commented 5 years ago

For those interested, here is the script to split tiles. Call it like ./split_tiles.sh ./tiles/ 18.

#! /bin/bash

set -o nounset
set -o errexit

TILEDIR=$(realpath $1)
ORIG_ZOOM=$2
NEW_ZOOM=$(( $ORIG_ZOOM + 1 ))

if [[ ! -d $TILEDIR/$ORIG_ZOOM ]] ; then
    exit
fi

NUM_FILES=$(find $TILEDIR/$ORIG_ZOOM -mindepth 2 -maxdepth 2 -type f | wc -l)
find $TILEDIR/$ORIG_ZOOM -mindepth 2 -maxdepth 2 ! -name '*sub*' -type f -printf "%P\n" | while read TILE ; do
    X=${TILE%%/*}
    Y=${TILE##*/}
    Y=${Y%%.*}
    mkdir -p $TILEDIR/$NEW_ZOOM/$(($X* 2))/
    mkdir -p $TILEDIR/$NEW_ZOOM/$(($X*2+1))/
    # Imagemagick can do -crop 50%x50% image_%d.png which writes 4 files, but
    # that doesn't work with webp
    convert $TILEDIR/$ORIG_ZOOM/$TILE +repage -crop 50%x50% $TILEDIR/$ORIG_ZOOM/$X/${Y}_sub%d.png 2>/dev/null
    mv $TILEDIR/$ORIG_ZOOM/$X/${Y}_sub0.png $TILEDIR/$NEW_ZOOM/$(($X*2))/$(($Y*2)).png || (ls -lh $TILEDIR/$ORIG_ZOOM/$X/ ; exit 1)
    mv $TILEDIR/$ORIG_ZOOM/$X/${Y}_sub1.png $TILEDIR/$NEW_ZOOM/$(($X*2))/$(($Y*2+1)).png || (ls -lh $TILEDIR/$ORIG_ZOOM/$X/ ; exit 1)
    mv $TILEDIR/$ORIG_ZOOM/$X/${Y}_sub2.png $TILEDIR/$NEW_ZOOM/$(($X*2+1))/$(($Y*2)).png || (ls -lh $TILEDIR/$ORIG_ZOOM/$X/ ; exit 1)
    mv $TILEDIR/$ORIG_ZOOM/$X/${Y}_sub3.png $TILEDIR/$NEW_ZOOM/$(($X*2+1))/$(($Y*2+1)).png || (ls -lh $TILEDIR/$ORIG_ZOOM/$X/ ; exit 1)
    rm -v $TILEDIR/$ORIG_ZOOM/$TILE
done | pv -l -s $NUM_FILES -N "splttiing $ORIG_ZOOM->$NEW_ZOOM" >/dev/null
find $TILEDIR/$ORIG_ZOOM -type d -empty -delete

mapbox / robosat

"Out of memory" error. Can anything be done? #131