Inception-v3 works, AlexNet causes OOM

ArnoXf commented 8 years ago

Hi! First of all, many thanks for your work! Installation with pip on Pi 3 with Jessie worked without any issues. I first tried the Inception-v3 classification you provided and it worked very well. Now I am trying to get AlexNet working on the Oxford 17 flowers dataset. I have the following configuration:

input_layer = input_data(shape=[None, 224, 224, 3])

conv1 = conv_2d(input_layer, 96, 11, strides=4, activation='relu')
pool1 = max_pool_2d(conv1, 3, strides=2)
network = local_response_normalization(pool1)

conv2 = conv_2d(network, 256, 5, activation='relu')
pool2 = max_pool_2d(conv2, 3, strides=2)
network = local_response_normalization(pool2)

conv3 = conv_2d(network, 384, 3, activation='relu')
conv4 = conv_2d(conv3, 384, 3, activation='relu')
conv5 = conv_2d(conv4, 256, 3, activation='relu')
pool3 = max_pool_2d(conv5, 3, strides=2)
network = local_response_normalization(pool3)

fc1 = fully_connected(network, 4096, activation='tanh')
dropout1 = dropout(fc1, 0.5)
fc2 = fully_connected(dropout1, 4096, activation='tanh')
dropout2 = dropout(fc2, 0.5)
fc3 = fully_connected(dropout2, 2, activation='softmax')
network = regression(fc3, optimizer='momentum', loss='categorical_crossentropy',
                     learning_rate=0.01)

This was written using the TFLearn API, but I think it gives a good overview over the layers and configuration. This code is working on my desktop computer but fails with OOM error in the fully connected layers on the Pi. Reducing the fc layers doesn't give OOM error with 1024 instead of 4096. Unfortunately it is still not training but just quitting after building up the network.

Any ideas how to solve this? Isn't the loaded Inception graph bigger than AlexNet?

samjabrahams commented 8 years ago

Thanks for posting this! I'll try to take a look at these memory issues sometime over the weekend. I believe that the issue is that you're actually training the AlexNet model on the Raspberry Pi, whereas the Inception model is pretrained.

When you train a model, the machine has to store values of each node along the way in order to compute the gradient, which means training takes a much larger amount of memory.

samjabrahams commented 8 years ago

If the goal is to use the Raspberry Pi to classify the flowers, I would suggest training the model on your desktop computer, saving/exporting it, and then loading that trained model onto the RPi. Check out the official how-to here.

Maybe I'll make some sort of baby TensorFlow Serving server for running pre-trained models on RPi at some point.

ArnoXf commented 8 years ago

Oh no, actually I already trained the model on my desktop machine using a GTX 750 ti. I saved my model there using tflearns model.save("my_model"). Then transfered the saved weights file to the Pi, build the net architecture there (like desicribed in my first post) and load the weights using model.load("my_model"). I don't want to train on the Pi, but just load the model and predict single images (what works on my desktop machine).

samjabrahams commented 8 years ago

Great- glad to hear you're already doing that! Next question: when you use model.save() and model.load(), are you including the last line of your code?

network = regression(fc3, optimizer='momentum', loss='categorical_crossentropy',
                     learning_rate=0.01)

Even if you pre-trained your weights, that line is going to cause your model to continue training when you run it on your Pi and not just feed values forward.

Apologies if you've already tried the things I'm suggesting: I don't know what you've previously attempted, so I'm trying to get a better understanding of where we stand.

mrubashkin-svds commented 8 years ago

Hey Sam, Thanks for answering @ArnoXf 's questions. Have you been able to successfully train any part or whole model on the Pi3? If not, do you know of any other places where things like "learning_rate" need to be turned off to avoid building model errors?

samjabrahams commented 8 years ago

Hi @mrubashkin-svds - are you referring to AlexNet, or any model in general? I've done toy training on the RPi to make sure that the TensorFlow binaries work properly, but I haven't done any significant training on large models with the Raspberry Pi.

I don't have much experience with TFLearn, so I'm not sure how it runs Sessions, but the main thing is to not pass in any sort of Optimizer Operations in Session.run(), otherwise it'll have to store a huge amount of data in memory.

mrubashkin-svds commented 8 years ago

Hey @samjabrahams thank you for the input! I've been working with Inception V3 specifically (no luck with AlexNet) over the past few days and while unable to build any model on the pi, I was able to persist a 85mb model in memory and evaluate single images against it in ~real-time (7 sec processing time).

One more question if you have the time, do you have any suggestions for speeding up the processing time? The time seems to be independent of the picture size (i.e. same amount of time for a 24x24 or 240x240 image). Thanks again @samjabrahams !!

samjabrahams commented 8 years ago

No problem! I believe that the Inception model resizes images automatically, which is why tiny images have the same compute time as huge ones. Getting the model to run faster is something that a fair number of people are currently working on. Here's a short list of things that may be causing the slowdown on the RPi compared to other computers running Inception on CPU:

Raspbian is a 32-bit operating system, which means that 64-bit computation suffers slowdown (I'm pretty sure the Inception V3 model has 64-bit numbers in there)
Slow read speed, both from the SD card and from RAM
Less powerful CPU
Compiler not optimizing everything perfectly

I'm probably forgetting several important factors, but that's something to start from. Here are some ways that one might try to alleviate these issues:

Build and train your own smaller model from scratch! It's going to be more difficult, but it may not be possible to achieve high-speed evaluation with the Inception model on the RPi, no matter how efficient we are able to get it. Especially if you are reading from a camera at the same time on the same RPi!
Try to find the "magic" compiler options. Until today, we were using the -mfpu=neon flag with the GCC compiler, which added a not-insignificant percentage improvement in speed. Unfortunately, the new 16-bit floating point types in TensorFlow don't play nice with NEON. But there may be other, even better options that could improve the speed of TensorFlow on its own!
Use 8-bit quantization to reduce overhead from the model. This is relatively new, and some of the kinks are still being worked out, but TensorFlow has quantization tools available in the repository. The main issue is that you have to jump through a few hoops to get it to work properly- I'm hoping to be able to finish testing it on the Inception model tomorrow. There's no guarantee that it will lower the speed (it depends on what exact methods are used, as only some Operations have quantized Ops ready). But we may as well try! Follow this comment chain to cover some "gotchas" if you want to tackle it on your own. I'm going to setup a Pi to compile the quantized ops overnight and I'll try to get back to you within a few days.

danbri commented 7 years ago

@samjabrahams on that last point, did you have any suggests with quantization?

samjabrahams / tensorflow-on-raspberry-pi

Inception-v3 works, AlexNet causes OOM #14