Closed wurining closed 1 year ago
The project refs a Lightning Pytorch framework.
I have noticed that this framework has successfully move model to a correct device, which I use cuda:0
(I checked each layer's weight at every step before run F.conv1d(input, weight, bias, self.stride,self.padding, self.dilation, self.groups)
)
Until run F.conv1d(input, weight, bias, self.stride,self.padding, self.dilation, self.groups)
, the magic appears!
I get an error👌
I guess it is caused by cuda memory cache or something. I am not quite know how the cuda's details, so I didn't go deeper.
Thanks for the report. This looks to be an issue that is entirely within pytorch. RMM doesn't reference the cacheInfo
symbol at all, all we do is tell pytorch that is should use RMM calls to allocate and deallocate memory (see https://github.com/rapidsai/rmm/blob/branch-23.12/python/rmm/allocators/torch.py).
Unfortunately, it appears that some pytorch algorithms require that the memory allocator in pytorch implement the cacheInfo
interface. This interface is provided in pytorch, but there is no way for an external allocator (like RMM) to implement it. I think the reason you don't see the error until the convolution is that the request for the cacheInfo
information only happens in convolutional layers: https://github.com/pytorch/pytorch/blob/9af82fa2b86fb71df503082b1960c9392f9dc66d/aten/src/ATen/native/cudnn/Conv_v7.cpp#L212
So I recommend you report an issue to pytorch, since it looks like they don't provide an interface that allows external allocators to work with pytorch programs in all cases.
Thanks for the report. This looks to be an issue that is entirely within pytorch. RMM doesn't reference the
cacheInfo
symbol at all, all we do is tell pytorch that is should use RMM calls to allocate and deallocate memory (see https://github.com/rapidsai/rmm/blob/branch-23.12/python/rmm/allocators/torch.py).Unfortunately, it appears that some pytorch algorithms require that the memory allocator in pytorch implement the
cacheInfo
interface. This interface is provided in pytorch, but there is no way for an external allocator (like RMM) to implement it. I think the reason you don't see the error until the convolution is that the request for thecacheInfo
information only happens in convolutional layers: https://github.com/pytorch/pytorch/blob/9af82fa2b86fb71df503082b1960c9392f9dc66d/aten/src/ATen/native/cudnn/Conv_v7.cpp#L212So I recommend you report an issue to pytorch, since it looks like they don't provide an interface that allows external allocators to work with pytorch programs in all cases.
Cool, it seem like happened in cudnn lib.
Cause I set torch.backends.cudnn.benchmark = True
to speed up, the PyTorch will replace general operators by cudnn's native operators. Seem the cacheInfo
is required here.
I try to set flag to False
, then everything go back normal.😊
Thank you very much.
Describe the bug
I'm integrating RMM as a replacement for the default PyTorch Allocator. Everything works fine in simpler scenarios. However, with my project involving mixed precision training, some native operators and etc., I'm encountering an error after introducing RMM.
The error seems to originate from this location in the PyTorch library: torch/csrc/cuda/CUDAPluggableAllocator.cpp#L174C30-L174C39
Steps/Code to reproduce bug Currently, I cannot find a easier way to actively raise this error.
Expected behavior The RMM accidently call the
cacheInfo
func and it is not supported currently.Environment details (please complete the following information):
rmm/print_env.sh
script to gather relevant environment detailsAdditional context Add any other context about the problem here.