When I make a training I met error. Could anyone give a help to solve this problem
My Laptop is Lenovo Legion7 prodcut with RTX3070
I run the rotated_maskcrnn using docker image : ubuntu16.04, CUDA9.0, python 3.6.5
nvidia-docker run -it --shm-size 2G -v /home/anthony/dev/r_maskrcnn:/home smilepse/cuda9.0-ubuntu16.04-py3-torch1.0 bash
(refer: docker pull smilepse/cuda9.0-ubuntu16.04-py3-torch1.0:latest )
and my pip list, development evn is in bottom
Traceback (most recent call last): File "tools/train_net.py", line 196, in <module> main() File "tools/train_net.py", line 189, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 89, in train arguments, File "/home/rotated_maskrcnn/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train loss_dict = model(images, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd **applier(kwargs, input_caster)) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 65, in forward features = self.backbone(images.tensors) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/backbone/resnet.py", line 149, in forward x = getattr(self, stage_name)(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/backbone/resnet.py", line 331, in forward out = self.conv2(out) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/layers/misc.py", line 33, in forward return super(Conv2d, self).forward(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
❓ Questions and Help
When I make a training I met error. Could anyone give a help to solve this problem
My Laptop is Lenovo Legion7 prodcut with RTX3070 I run the rotated_maskcrnn using docker image : ubuntu16.04, CUDA9.0, python 3.6.5
nvidia-docker run -it --shm-size 2G -v /home/anthony/dev/r_maskrcnn:/home smilepse/cuda9.0-ubuntu16.04-py3-torch1.0 bash
(refer:docker pull smilepse/cuda9.0-ubuntu16.04-py3-torch1.0:latest
) and my pip list, development evn is in bottomTraining command
python tools/train_net.py --config-file "configs/rotated/e2e_ms_rcnn_R_50_FPN_1x.yaml"
error msg :2021-09-03 03:19:09,408 maskrcnn_benchmark.trainer INFO: Start training
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function
Traceback (most recent call last): File "tools/train_net.py", line 196, in <module> main() File "tools/train_net.py", line 189, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 89, in train arguments, File "/home/rotated_maskrcnn/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train loss_dict = model(images, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd **applier(kwargs, input_caster)) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 65, in forward features = self.backbone(images.tensors) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/backbone/resnet.py", line 149, in forward x = getattr(self, stage_name)(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/modeling/backbone/resnet.py", line 331, in forward out = self.conv2(out) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/rotated_maskrcnn/maskrcnn_benchmark/layers/misc.py", line 33, in forward return super(Conv2d, self).forward(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
pip list
absl-py 0.3.0 apex 0.1 astor 0.7.1 atari-py 0.1.1 baselines 0.1.5 Box2D 2.3.2 Box2D-kengz 2.3.3 box2d-py 2.3.4 certifi 2018.4.16 cffi 1.11.5 chardet 3.0.4 click 6.7 cloudpickle 0.5.3 cycler 0.10.0 Cython 0.28.4 dill 0.2.8.2 future 0.16.0 gast 0.2.0 glfw 1.7.0 grpcio 1.14.1 gym 0.10.5 idna 2.7 imageio 2.3.0 joblib 0.12.2 kiwisolver 1.0.1 Markdown 2.6.11 maskrcnn-benchmark 0.1 /home/rotated_maskrcnn matplotlib 2.2.2 mpi4py 3.0.0 mujoco-py 1.50.1.56 numpy 1.16.0 opencv-python 3.4.2.17 pandas 0.23.3 Pillow 5.2.0 pip 18.0 progressbar2 3.38.0 protobuf 3.6.0 pycocotools 2.0 pycparser 2.18 pycurl 7.43.0 pyglet 1.3.2 pygobject 3.20.0 PyOpenGL 3.1.0 pyparsing 2.2.0 python-apt 1.1.0b1+ubuntu0.16.4.1 python-dateutil 2.7.3 python-utils 2.3.0 pytz 2018.5 PyYAML 5.4.1 pyzmq 17.1.0 requests 2.19.1 scikit-learn 0.19.2 scipy 1.1.0 setproctitle 1.1.10 setuptools 39.1.0 six 1.11.0 tensorboard 1.10.0 tensorflow 1.10.0 termcolor 1.1.0 tk 0.1.0 torch 1.0.0 torchvision 0.2.1 tqdm 4.24.0 urllib3 1.23 Werkzeug 0.14.1 wheel 0.31.1 yacs 0.1.8 zmq 0.0.0
Host nvidia-smi
+-------------------------------------------------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------------------------+--------------------------+--------------------------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |======================+==================+=====================| | 0 GeForce RTX 307... Off | 00000000:01:00.0 On | N/A | | N/A 45C P8 18W / N/A | 846MiB / 7982MiB | 1% Default | | | | N/A | +----------------------------------------+----------------------------------+---------------------------------------+ +--------------------------------------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |==============================================================| | 0 N/A N/A 1241 G /usr/lib/xorg/Xorg 26MiB | | 0 N/A N/A 2693 G /usr/bin/gnome-shell 88MiB | | 0 N/A N/A 3281 G /usr/lib/xorg/Xorg 356MiB | | 0 N/A N/A 3422 G /usr/bin/gnome-shell 219MiB | +------------------------------------------------------------------------------------------------------------------+
nvcc
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176