twitter-archive / torch-ipc

A set of primitives for parallel computation in Torch
Apache License 2.0
95 stars 28 forks source link

WARN: torch-ipc: CUDA IPC evicting 0x431a37e800 due to ptr reuse #18

Closed chienlinhuang1116 closed 8 years ago

chienlinhuang1116 commented 8 years ago

Hi,

It showed the following message when running MNIST dataset with 6GPUs. Do you have any idea?

Thanks, Chien-Lin

WARN: torch-ipc: CUDA IPC evicting 0x431a37e800 due to ptr reuse, performance will drastically suffer
WARN: torch-ipc: CUDA IPC evicting 0x431a37e800 due to ptr reuse, performance will drastically suffer
zakattacktwitter commented 8 years ago

Hi,

That means your tensors are not stable between gradient computes. The GPU->GPU communication works by mapping one GPU's memory into the other's. If you are constantly destroying and creating tensors then a overlapping bit of memory will most likely be reused, we need to detect this, unmap the previous tensor and remap in the new tensor. Its very expensive to do this.

Are you sure you have the latest version of autograd? It is this line that causes autograd to keep the gradients stable (examples/mnist.lua line 91):

local df = grad(f, {
   optimize = true,              -- Generate fast code
   stableGradients = true,       -- Keep the gradient tensors stable so we can use CUDA IPC
})

It is also possible that there is some other problem, do you only see the issue when using 6 GPUs? Is 1, 4 or 8 GPUs OK (i.e. no warnings)?

Hope this helps, Zak

chienlinhuang1116 commented 8 years ago

Thank you Zak. Sometimes, there are awarnings using 2 and 8 GPUs. I will try the latest version of autograd and see whether it is still a problem.

Chien-Lin

zakattacktwitter commented 8 years ago

Going to close this since its been a while. Let me know if you are still experiencing problems.