Closed chienlinhuang1116 closed 8 years ago
Hi,
That means your tensors are not stable between gradient computes. The GPU->GPU communication works by mapping one GPU's memory into the other's. If you are constantly destroying and creating tensors then a overlapping bit of memory will most likely be reused, we need to detect this, unmap the previous tensor and remap in the new tensor. Its very expensive to do this.
Are you sure you have the latest version of autograd? It is this line that causes autograd to keep the gradients stable (examples/mnist.lua line 91):
local df = grad(f, {
optimize = true, -- Generate fast code
stableGradients = true, -- Keep the gradient tensors stable so we can use CUDA IPC
})
It is also possible that there is some other problem, do you only see the issue when using 6 GPUs? Is 1, 4 or 8 GPUs OK (i.e. no warnings)?
Hope this helps, Zak
Thank you Zak. Sometimes, there are awarnings using 2 and 8 GPUs. I will try the latest version of autograd and see whether it is still a problem.
Chien-Lin
Going to close this since its been a while. Let me know if you are still experiencing problems.
Hi,
It showed the following message when running MNIST dataset with 6GPUs. Do you have any idea?
Thanks, Chien-Lin