Could not train on v100

qinhaifangpku commented 5 years ago

PLEASE FOLLOW THESE INSTRUCTIONS BEFORE POSTING

Read the README.md thoroughly ! README.md is not a decoration.
Please search existing open and closed issues in case your issue has already been reported
Please try to debug the issue in case you can solve it on your own before posting

After following steps above and agreeing to provide the detailed information requested below, you may continue with posting your issue

(Delete this line and the text above it.)

Expected results

What did you expect to see?

Actual results

What did you observe instead?

Detailed steps to reproduce

E.g.:

The command that you ran

System information

Operating system: Linux
CUDA version: 9.0
cuDNN version: ?
GPU models (for all devices if they are not all the same): ?
python version: 3.5
pytorch version: ?
Anything else that seems relevant: ?

Hi, Thanks for sharing this great work! I want to train mask-rcnn on v100, but it got error like this: cudaCheckError() failed : no kernel image is available for execution on the device THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=29 : driver shutting down Segmentation faultl

Can anyone help me out of this problem?

Thank you in advance!

santoshmo commented 5 years ago

Please follow these instructions in the compilation section:

If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.

qinhaifangpku commented 5 years ago

Please follow these instructions in the compilation section:

If your are using Volta GPUs, uncomment this line in lib/mask.sh and remember to postpend a backslash at the line above. CUDA_PATH defaults to /usr/loca/cuda. If you want to use a CUDA library on different path, change this line accordingly.

yeah, I have done like you suggested but it will failed the training after some iterations. The memory would not release but the GPU computaion is 0%

shinya7y commented 5 years ago

@qinhaifangpku Could you update your nvidia driver?

Gasoonjia commented 5 years ago

@qinhaifangpku I'm facing the same problem with you when training the mask-rcnn on V100, too. But such problem has never appeared when using GPU with other frameworks, such as 1080ti. Due to the messages shown after ctrl-c, I think there are some deadlock lock problems when using Volta GPUs. And the project faces some dead-lock problem indeed. We can find some traces from the code, such as https://github.com/roytseng-tw/Detectron.pytorch/blob/8315af319cd29b8884a7c0382c4700a96bf35bbc/tools/train_net_step.py#L18. I do not figure out how to solve the problem fundamentally. Changing Volta GPU into Pascal GPU is the only way I know. Hoping these help you.

roytseng-tw / Detectron.pytorch