Open cnuernber opened 8 years ago
One thing I have found so far is that if your net is running in floating point mode then make sure your dataset produces floating point arrays. This can be quite a dramatic speed increase (factor of 100).
@calder @charlesg3
The intent of this issue is to characterize the problem while leaving the choice of implementation strategies wide open. If the problem is characterized as to arbitrarily narrow the choice of implementations then it is mis-characterized.
We would like to upgrade the time it takes to put data onto the gpu and pull it off. Plus we would like a set of standard automatic augmentations that can be performed ideally inline with loading the image (crop, flip, translate, scale, rotate, potentially color space transformation). Inline means during training and not a preprocess step; we would like our networks to never see the exact same image twice during training.
Most gpu-based neural networks tend to not get to full utilization of the GPU because at least in part getting the data to the GPU and off of the GPU effectively throttles the training/inference.
Because we have a few people working in networks that are doing image analysis and it seems this will continue for the near future, it would be good to invest some time building out tools and a system to use to do this.
Setting some baselines, assume 10,000 images of 256 by 256. An output of 1000 float/double numbers.
If we can get inline loading of images (meaning if we do not need to write out a specialized file) working on a normal compute we can get through 10,000 images fast enough we should be able to avoid writing out a specialized file. So the first step is can you load 10,000 images in under like 10 seconds on the cpu. Ideally under 5 because we would also like to apply some elementary operations in order to augment datasets so having another 5 seconds to apply random loss-invariant transformations would be ideal.
You could also write these images into a memory mapped file (of bytes or floats) and load the file but there is a solid chance that opencv implements the ideal transformations in considerably less time than we can implement them in java but there is also a chance that is false.
The worst scenario is to write out a binary file post-transformations. This means that our nets could potentially learn the specific transformations which we certainly do not want.
Then can we need to shuffle data onto the GPU in into a coalesced buffer for a batch size of say 64-128. Then a similar system to shuffle the 1000 doubles off the gpu to the cpu with same batch size, and perform some analysis on those vectors (like generate loss/softmax accuracy, etc).