keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.3k stars 19.38k forks source link

CNN training is slow on large inputs #798

Closed mmmikael closed 8 years ago

mmmikael commented 8 years ago

There is a noticeable speed difference for training VGG-like networks between Keras and Caffe on large inputs.

Here is a toy network I am using for benchmarking (e.q. first layers of VGG nets):

s = (850, 649)  # input image size
layers = [
    Convolution2D(64, 3, 3, activation='relu', border_mode='same', input_shape=(1, s[0], s[1])),
    Convolution2D(64, 3, 3, activation='relu', border_mode='same'),
    Convolution2D(21, 3, 3, activation='relu', border_mode='same'),
    Permute((2, 3, 1)),
    Activation('softmax')
]
model = Sequential()
for l in layers:
    model.add(l)

In this example, outputs are images of same size as the inputs and I use a slightly modified categorical_crossentropy loss to sum over spatial dims.

The benchmark is run as follows:

# compile
t = now()
model.compile(optimizer='adagrad', loss=categorical_crossentropy_2d)
print('Compilation time %s' % (now() - t))

# train
t = now()
model.fit(x, y, batch_size=1, nb_epoch=1)
print('Training time for %d images (batch_size=1): %s' % (n_images, (now() - t)))
t = now()
model.fit(x, y, batch_size=n_images, nb_epoch=1)
print('Training time for %d images (batch_size=%s): %s' % (n_images, n_images, (now() - t)))

# predict
t = now()
model.predict(x)
print('Prediction time for %d images: %s' % (n_images, (now() - t)))

with outputs:

Compilation time 0:00:18.592734
Epoch 1/1
10/10 [==============================] - 15s - loss: 17846727.2000    
Training time for 10 images (batch_size=1): 0:00:15.480930
Epoch 1/1
10/10 [==============================] - 9s - loss: 176363904.0000
Training time for 10 images (batch_size=10): 0:00:09.365067
Prediction time for 10 images: 0:00:00.774903

For some reason, there is a huge difference between training and prediction time.

The complete python file is available here,

The same training with Caffe is much faster. Around 2.5 seconds for 10 images:

Average Forward pass: 35.1625 ms.
Average Backward pass: 171.409 ms.
Average Forward-Backward: 206.647 ms.
fchollet commented 8 years ago

The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).

If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.

For a VGG network, a batch size of 64 or 128 generally works well.

mmmikael commented 8 years ago

The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).

The 2.5 seconds is actually the full training time in Caffe. I noticed the speed difference on bigger networks that take much more time. I put a simpler and shorter example here to make it easier to reproduce.

If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.

For a VGG network, a batch size of 64 or 128 generally works well.

I am indeed using the GPU but with the size of input images in this example (850x649), a batch of 10 images is already pretty big. VGG networks are trained on 224x224 inputs. The transfer of the input images to GPU memory does not seem to be the reason since prediction time is quite low comparatively.

Could it be that some intermediate feature maps are transferred back and forth between CPU & GPU memory during training? Those images are relatively: big 850x649x64.

fchollet commented 8 years ago

For a regular VGG net, the entire computation graph will be running on GPU end-to-end.

I do not believe that 10 images is a reasonable batch size in this case. Most likely that is your problem right there.

You can try profiling your code to see which Theano operations are taking the most time: http://deeplearning.net/software/theano/tutorial/profiling.html

On 8 October 2015 at 11:08, Mikael Rousson notifications@github.com wrote:

The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).

The 2.5 seconds is actually the full training time in Caffe. I noticed the speed difference on bigger networks that take much more time. I put a simpler and shorter example here to make it easier to reproduce.

If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.

For a VGG network, a batch size of 64 or 128 generally works well.

I am indeed using the GPU but with the size of input images in this example (850x649), a batch of 10 images is already pretty big. VGG networks are trained on 224x224 inputs. The transfer of the input images to GPU memory does not seem to be the reason since prediction time is quite low comparatively.

Could it be that some intermediate feature maps are transferred back and forth between CPU & GPU memory during training? Those images are relatively: big 850x649x64.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/798#issuecomment-146641412.

mmmikael commented 8 years ago

I do not believe that 10 images is a reasonable batch size in this case. Most likely that is your problem right there.

Training a 10-image batch already takes 7GB on the GPU.

You can try profiling your code to see which Theano operations are taking the most time: http://deeplearning.net/software/theano/tutorial/profiling.html

Thanks, I will give this a try.

Do you have any idea how to explain the difference between training and prediction times: 9.4 sec. vs 0.8 sec. for the same number of images?

mmmikael commented 8 years ago

Here is the profiling https://gist.github.com/mmmikael/73b62a911748a33f17ef.

Nothing suspicious to me at first sight.

lemuriandezapada commented 8 years ago

Well training also has to propagate the errors and update the gradients. Maybe there's something in how that gets done.

mmmikael commented 8 years ago

I was able to get the time down from 10 to 4 seconds by flattening the tensor before the loss and disabling CuDNN.

In some cases CuDNN can be slower, in particular for small batches: https://devtalk.nvidia.com/default/topic/875666/cudnn-may-be-slower-/

nouiz commented 8 years ago

Just some small explanation about why the flatten have helped.

The profile contain two lines that have CPU implementation:

16.1% 68.6% 1.635s 1.63e+00s Py 1 1 theano.tensor.subtensor.AdvancedIncSubtensor 5.9% 74.5% 0.600s 2.00e-01s Py 3 3 theano.tensor.subtensor.AdvancedSubtensor

Theano do not support all type of advanced subtensor on the GPU. the flatten would have changed the code to use one version that is supported on the GPU.

Le 13 oct. 2015 09:23, "Mikael Rousson" notifications@github.com a écrit :

I was able to get the time down from 10 to 4 seconds by flattening the tensor before the loss and disabling CuDNN.

In some cases CuDNN can be slower, in particular for small batches: https://devtalk.nvidia.com/default/topic/875666/cudnn-may-be-slower-/

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/798#issuecomment-147712330.

EderSantana commented 8 years ago

@mmmikael how did you disable cuDNN? Programmatically or changing theanorc?

mmmikael commented 8 years ago

@nouiz thanks for the details!

@EderSantana I used THEANO_FLAGS=optimizer_excluding=cudnn and changed the test in convolution.py to use T.nnet.conv.conv2d.