The below lines with torch.ones and torch.zeros in the focus loss are allocating arrays on one device and then transfering them to gpu using .cuda. This is inefficient especially in cases where DataParallel is over larger numbers of gpu's as the main process will have to move the data to each gpu individually.
Instead they should be allocated directly on the gpu using e.g. ones_like to copy all traits of the original array including their device. My tests with the below show an approximately 4x speedup on 6 GPU's using DataParallel (though I am also using a custom parallel dataloader so your performance may vary)
The below lines with torch.ones and torch.zeros in the focus loss are allocating arrays on one device and then transfering them to gpu using .cuda. This is inefficient especially in cases where DataParallel is over larger numbers of gpu's as the main process will have to move the data to each gpu individually.
Instead they should be allocated directly on the gpu using e.g. ones_like to copy all traits of the original array including their device. My tests with the below show an approximately 4x speedup on 6 GPU's using DataParallel (though I am also using a custom parallel dataloader so your performance may vary)