Open macabdul9 opened 1 year ago
@macabdul9 removing the cached pytorch extension works for me.
rm -rf /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117
the same issue occur in my program. And I also find the reason of stuck is the tensor can not be moved to GPU. the same error will happen when i use tensor.cuda(). I don't even know how to fix it.
@theblackcat102, @macabdul9 is this issue now resolved?
@newtonysls, it sounds like a different problem can you please open a new ticket?
Thanks
@theblackcat102, @macabdul9 is this issue now resolved?
@newtonysls, it sounds like a different problem can you please open a new ticket?
Thanks
problem solve. My issue cased by the wrong setting of bios in the GPU
@macabdul9 removing the cached pytorch extension works for me.
rm -rf /nfs/users/ext_abdul.waheed/.cache/torch_extensions/py310_cu117
But how can I do this inside a docker container ?
same question.
thx!!!
saved my day!
sam Q
hi, i met the same question. Removing .cache does not work. Have you solve the problem? @LinB203 @Yiqiu-Zhang
Issue: Training doesn't begin after loading the model.
DS_REPORT
More Details:
Here are the last few lines from the logs:
Nothing happens after this. GPUs memory utilization remains same.
**
nvidia-smi
output**CC: @HeyangQin @tjruwase