Closed cardboardcode closed 2 years ago
Tried rebooting workstation. However, same error was reported as above.
Attempted reproduction of error by reconfiguring local workstation to mimic above Environment:
20.04
515.48.07
11.7
8.4.1
The error source can be deduced to not be due to Nvidia Drivers version discrepancies. It could be Anaconda virtualization.
Cuda error: no kernel image is available for execution on the device from PyTorch
This may concern pytorch dependencies found in maskrcnn-benchmark when setting up P3-TrainFarm and P3-Exporter.
This issue should be resolved with EPD v0.3.0
Pull Request #56.
Testing to be done on fresh workstation with the above environment factors replicated before closing.
Still encountering issue when running run_test_gui_gpu_local_only.bash
. See below for terminal output error:
This is extracted from the aformentioned raw terminal log output:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Install nvidia-docker2 and Reboot:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
515.43.64
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M1200 On | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 N/A / N/A | 7MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1111 G /usr/lib/xorg/Xorg 2MiB | | 0 N/A N/A 1659 G /usr/lib/xorg/Xorg 2MiB | +-----------------------------------------------------------------------------+
3. CUDA `11.7`
```bash
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error
Traceback (most recent call last):
File "tools/train_net.py", line 201, in <module>
main()
File "tools/train_net.py", line 194, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 39, in train
model.to(device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 432, in to
return self._apply(convert)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 230, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 430, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 179, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50
Reboot. Verified to work.
The following error is encountered when Training for both P3 and P2 MaskRCNN models have already started. Installation of dependencies have succeeded.
RuntimeError: Not compiled with GPU support (nms at /home/user/p2_trainer/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f76f0e97273 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0x138 (0x7f76e14a3368 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x1a9c5 (0x7f76e14b39c5 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x18592 (0x7f76e14b1592 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #6: python() [0x5755f4]
frame #7: python() [0x57ea7b]
frame #8: python() [0x57da3c]
frame #10: python() [0x57521f]
frame #11: python() [0x57ea7b]
...
RuntimeError: Not compiled with GPU support (nms at /home/user/p2_trainer/maskrcnn_benchmark/csrc/nms.h:22)
Pending...
cd $HOME
git clone https://github.com/cardboardcode/easy_perception_deployment --branch dev --depth 1 public_epd
cd ~/public_epd/easy_perception_deployment/gui
# Comment out incremental GUI Local-Only GPU Training pytests
bash run_test_gui_gpu_local_only.bash
# Observe any failing pytests.
Search Issue 230 under https://github.com/facebookresearch/maskrcnn-benchmark.
Closed with EPD v0.3.2. Verified with repeatable successful dockerized workflow for P2 and P3 training and exporting.
Issue
Unable to build P3 Trainfarm to generate
.onnx
model files. This issue will be used to track progress for resolving this issue.Environment
20.04
LTS515.43
10.2
7.6.5
Error Report
The following is the critical error reported in the terminal upon running a training process:
Click to expand!
```bash Done (t=0.00s) creating index... index created! 2022-07-12 17:37:43,508 maskrcnn_benchmark.utils.miscellaneous INFO: Saving labels mapping into ./weights/custom/labels.json loading annotations into memory... Done (t=0.00s) creating index... index created! 2022-07-12 17:37:43,540 maskrcnn_benchmark.trainer INFO: Start training Traceback (most recent call last): File "tools/train_net.py", line 201, inError Abstract
Based on the Error Report, the main error reported can be condensed to the following: