Benchmark using pinned CUDA memory in data loaders

mapbox / robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds

MIT License

2.02k stars 383 forks source link

Benchmark using pinned CUDA memory in data loaders #21

Open daniel-j-h opened 6 years ago

daniel-j-h commented 6 years ago

At the moment the data loaders load up images from the dataset, do pre-processing (like normalization), and then convert the images into tensors. Then we copy the data from CPU memory to GPU memory. This can be made more efficient by putting data into page-locked memory and using DMA to the data onto the GPU in async. fashion.

Look into functionality for pinning memory and async and non-blocking data transfers:

Note: the last time we used this we ran into some PyTorch-internal deadlocks. We need to carefully evaluate this, benchmark it, and figure out if it makes sense to go this route.

Tasks:

[x] Check out docs for cuda semantics
[x] Change memory copying behavior
[ ] Benchmark and test for both training as well as prediction

daniel-j-h commented 5 years ago

I tested pinned memory leading to DMA copies in non-blocking mode. But I did not see any improvements on my 6xgtx1080ti rig where the bus seems to be the limiting factor.

Leaving this open in case anyone has a sandbox environment to see if it improves things.

To reproduce

use the pinned memory flag in all data loaders (train and predict tool)
use the non-blocking flag when copying tensors in the train/predict tool

ocourtin commented 5 years ago

@daniel-j-h I confirm i did'nt seen any significant perf improvement throught PyTorch CUDA pinned setting.

On the other hands, among identified points mattering for training:

Several DataLoader processes (to be sure GPU is about ~100%)
More efficient Data Augmentation step (switching to Albumentations -and- removing tiles buffering keep accuracy but about x3 faster)

HTH,

daniel-j-h commented 5 years ago

Here is why we didn't see any improvement with pin_memory=True:

all our datasets return tuples, e.g. the tile tensor but then also the tile z, x, y ids
the default pytorch mechanism only supports pinning tensors directly

Per https://pytorch.org/docs/stable/data.html#memory-pinning

The default memory pinning logic only recognizes Tensors and maps and iterables containing Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a collate_fn that returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data type(s), define a pin_memory() method on your custom type(s).

Therefore even if we set pin_memory=True it will just silently fail.

Solution is to write a custom type with its own pin_memory function pinning the tensors to memory.

daniel-j-h commented 5 years ago

@ocourtin re.

On the other hands, among identified points mattering for training:

* Several DataLoader processes (to be sure GPU is about ~100%)

* More efficient Data Augmentation step (switching to Albumentations
  -and- removing tiles buffering keep accuracy but about x3 faster)

Adding here: switching to libjpeg-turbo and Pillow-SIMD gave me a huge boost during pre-processing.

See https://github.com/mapbox/robosat/pull/180