Closed mmmikael closed 8 years ago
The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).
If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.
For a VGG network, a batch size of 64 or 128 generally works well.
The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).
The 2.5 seconds is actually the full training time in Caffe. I noticed the speed difference on bigger networks that take much more time. I put a simpler and shorter example here to make it easier to reproduce.
If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.
For a VGG network, a batch size of 64 or 128 generally works well.
I am indeed using the GPU but with the size of input images in this example (850x649), a batch of 10 images is already pretty big. VGG networks are trained on 224x224 inputs. The transfer of the input images to GPU memory does not seem to be the reason since prediction time is quite low comparatively.
Could it be that some intermediate feature maps are transferred back and forth between CPU & GPU memory during training? Those images are relatively: big 850x649x64.
For a regular VGG net, the entire computation graph will be running on GPU end-to-end.
I do not believe that 10 images is a reasonable batch size in this case. Most likely that is your problem right there.
You can try profiling your code to see which Theano operations are taking the most time: http://deeplearning.net/software/theano/tutorial/profiling.html
On 8 October 2015 at 11:08, Mikael Rousson notifications@github.com wrote:
The times reported do not represent the same thing and not comparable. If you want to do a side-by-side comparison, time the full training end-to-end (and make sure it takes at least a minute or two).
The 2.5 seconds is actually the full training time in Caffe. I noticed the speed difference on bigger networks that take much more time. I put a simpler and shorter example here to make it easier to reproduce.
If you're running on GPU, you will also want to use reasonable batch sizes. In Keras, data is loaded on the GPU on batch at a time, so you should size your batch in a way that 1) the computation time of the batch is much larger than the time it takes to load it on GPU, i.e. the batch is large enough, and 2) the batch fits in the GPU memory.
For a VGG network, a batch size of 64 or 128 generally works well.
I am indeed using the GPU but with the size of input images in this example (850x649), a batch of 10 images is already pretty big. VGG networks are trained on 224x224 inputs. The transfer of the input images to GPU memory does not seem to be the reason since prediction time is quite low comparatively.
Could it be that some intermediate feature maps are transferred back and forth between CPU & GPU memory during training? Those images are relatively: big 850x649x64.
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/798#issuecomment-146641412.
I do not believe that 10 images is a reasonable batch size in this case. Most likely that is your problem right there.
Training a 10-image batch already takes 7GB on the GPU.
You can try profiling your code to see which Theano operations are taking the most time: http://deeplearning.net/software/theano/tutorial/profiling.html
Thanks, I will give this a try.
Do you have any idea how to explain the difference between training and prediction times: 9.4 sec. vs 0.8 sec. for the same number of images?
Here is the profiling https://gist.github.com/mmmikael/73b62a911748a33f17ef.
Nothing suspicious to me at first sight.
Well training also has to propagate the errors and update the gradients. Maybe there's something in how that gets done.
I was able to get the time down from 10 to 4 seconds by flattening the tensor before the loss and disabling CuDNN.
In some cases CuDNN can be slower, in particular for small batches: https://devtalk.nvidia.com/default/topic/875666/cudnn-may-be-slower-/
Just some small explanation about why the flatten have helped.
The profile contain two lines that have CPU implementation:
16.1% 68.6% 1.635s 1.63e+00s Py 1 1 theano.tensor.subtensor.AdvancedIncSubtensor 5.9% 74.5% 0.600s 2.00e-01s Py 3 3 theano.tensor.subtensor.AdvancedSubtensor
Theano do not support all type of advanced subtensor on the GPU. the flatten would have changed the code to use one version that is supported on the GPU.
Le 13 oct. 2015 09:23, "Mikael Rousson" notifications@github.com a écrit :
I was able to get the time down from 10 to 4 seconds by flattening the tensor before the loss and disabling CuDNN.
In some cases CuDNN can be slower, in particular for small batches: https://devtalk.nvidia.com/default/topic/875666/cudnn-may-be-slower-/
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/798#issuecomment-147712330.
@mmmikael how did you disable cuDNN? Programmatically or changing theanorc?
@nouiz thanks for the details!
@EderSantana I used THEANO_FLAGS=optimizer_excluding=cudnn
and changed the test in convolution.py to use T.nnet.conv.conv2d
.
There is a noticeable speed difference for training VGG-like networks between Keras and Caffe on large inputs.
Here is a toy network I am using for benchmarking (e.q. first layers of VGG nets):
In this example, outputs are images of same size as the inputs and I use a slightly modified
categorical_crossentropy
loss to sum over spatial dims.The benchmark is run as follows:
with outputs:
For some reason, there is a huge difference between training and prediction time.
The complete python file is available here,
The same training with Caffe is much faster. Around 2.5 seconds for 10 images: