Remove the bottleneck of training by reading images direct to cudaMemory

jxwuyi commented 8 years ago

The training speed of matconvnet is limited by the getBatch function. More specifically, it's limited by the data transfer from CPU memory to GPU memory.

On a K40 GPU, caffe trains alexnet at 300 images/ second (cudnn v4), while matconvnet could only do at 185 images/second. I do some experiments to see what's the bottleneck, and it turns out that the data transfer after vl_imreadjpeg is the main bottleneck.

I suggest to change vl_imreadjpeg that it could directly read the images into GPU memory. I did a simulated experiments: replace the whole get batch function by a gpuArray.rand(227,227,3,256) random number generator, and the speed of matconvnet reaches 260 images/second. That shows the potential speedup could be huge.

okvol commented 8 years ago

Good suggestion. Look forward to a solution.

baiyancheng20 commented 8 years ago

@jxwuyi Could you give a detailed implementation? When I used vl_imreadjpeg, I found it was even slower than imread (matlab function) with parfor.

pxnguyen commented 8 years ago

@baiyancheng20 did you compile the vl_imreadjpeg correctly, with GPU support? This is a lot faster than imread for me.

baiyancheng20 commented 8 years ago

@pxnguyen I use the following commands vl_compilenn('enableGpu', true, ... 'cudaMethod', 'nvcc', ... 'cudaRoot', '/usr/local/cuda-7.0', ... 'enableCudnn', true, ... 'cudnnRoot', '/usr/local/cuda-7.0') ; Could you help me find any error?

lenck commented 8 years ago

Hi, the newest version of vl_imreadjpeg does this (from beta-21).

vlfeat / matconvnet

Remove the bottleneck of training by reading images direct to cudaMemory #424