Open 1374839016 opened 1 year ago
Huge thanks for bringing this up!
Could you provide some more technical details on how this makes a difference? Currently, all the involved tensors will be on the same device as the first input as per:
I am hence a little bit confused on what the proposed fix would change. :thinking:
Sorry, I don't know, but I guess the code allocate shared memory on default device(GPU 0).
cupy_launch('kernel_Correlation_updateOutput', cupy_kernel('kernel_Correlation_updateOutput', {
'rbot0': rbot0,
'rbot1': rbot1,
'top': output
}))(
grid=tuple([ output.shape[3], output.shape[2], output.shape[0] ]),
block=tuple([ 32, 1, 1 ]),
shared_mem=one.shape[1] * 4,
args=[ cupy.int32(n), rbot0.data_ptr(), rbot1.data_ptr(), output.data_ptr() ]
)
I fix the memory access bug, which describe here #55 . I force cupy allocate memory on pytorch device.