Issue building P3 trainfarm with Nvidia Driver 515 and CUDA 11.7.

cardboardcode commented 2 years ago

Issue

Unable to build P3 Trainfarm to generate .onnx model files. This issue will be used to track progress for resolving this issue.

Environment

Ubuntu 20.04 LTS
ROS2 Foxy
Nvidia Driver 515.43
CUDA 10.2
CUDNN 7.6.5

Error Report

The following is the critical error reported in the terminal upon running a training process:

Click to expand!

```bash Done (t=0.00s) creating index... index created! 2022-07-12 17:37:43,508 maskrcnn_benchmark.utils.miscellaneous INFO: Saving labels mapping into ./weights/custom/labels.json loading annotations into memory... Done (t=0.00s) creating index... index created! 2022-07-12 17:37:43,540 maskrcnn_benchmark.trainer INFO: Start training Traceback (most recent call last): File "tools/train_net.py", line 201, in main() File "tools/train_net.py", line 194, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 94, in train arguments, File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/engine/trainer.py", line 84, in do_train loss_dict = model(images, targets) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/_initialize.py", line 197, in new_fwd **applier(kwargs, input_caster)) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward proposals, proposal_losses = self.rpn(images, features, targets) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py", line 159, in forward return self._forward_train(anchors, objectness, rpn_box_regression, targets) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py", line 175, in _forward_train anchors, objectness, rpn_box_regression, targets File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py", line 140, in forward sampled_boxes.append(self.forward_for_single_feature_map(a, o, b)) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py", line 120, in forward_for_single_feature_map score_field="objectness", File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/structures/boxlist_ops.py", line 27, in boxlist_nms keep = _box_nms(boxes, score, nms_thresh) File "/home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/amp.py", line 22, in wrapper return orig_fn(*args, **kwargs) RuntimeError: CUDA error: no kernel image is available for execution on the device (launch_kernel at /tmp/pip-req-build-p5q91txh/aten/src/ATen/native/cuda/Loops.cuh:102) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator > const&) + 0x6d (0x7f6ecc7271cd in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: void at::native::gpu_index_kernel<__nv_dl_wrapper_t<__nv_dl_tag, c10::ArrayRef), &(void at::native::index_kernel_impl >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef)), 1u>> >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef, __nv_dl_wrapper_t<__nv_dl_tag, c10::ArrayRef), &(void at::native::index_kernel_impl >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef)), 1u>> const&) + 0x85f (0x7f6ed1dc54cf in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #2: + 0x565e183 (0x7f6ed1dc8183 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #3: + 0x565e548 (0x7f6ed1dc8548 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #4: + 0x1359cb2 (0x7f6ecdac3cb2 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #5: at::native::index(at::Tensor const&, c10::ArrayRef) + 0x460 (0x7f6ecdac4610 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #6: at::TypeDefault::index(at::Tensor const&, c10::ArrayRef) + 0x89 (0x7f6ece0647a9 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::VariableType::index(at::Tensor const&, c10::ArrayRef) + 0x861 (0x7f6ecfa02081 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #8: at::Tensor::index(c10::ArrayRef) const + 0x81 (0x7f6ecf7cdf11 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #9: nms_cuda(at::Tensor, float) + 0x7c4 (0x7f6e93fe3930 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #10: nms(at::Tensor const&, at::Tensor const&, float) + 0x3ff (0x7f6e93f9eb2f in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #11: + 0x45b93 (0x7f6e93fb0b93 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) frame #12: + 0x437b4 (0x7f6e93fae7b4 in /home/rosi/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so) ```

Error Abstract

Based on the Error Report, the main error reported can be condensed to the following:

RuntimeError: CUDA error: no kernel image is available for execution on the device (launch_kernel at /tmp/pip-req-build-p5q91txh/aten/src/ATen/native/cuda/Loops.cuh:102)

cardboardcode commented 2 years ago

Solution Attempted

Tried rebooting workstation. However, same error was reported as above.

cardboardcode commented 2 years ago

Attempted reproduction of error by reconfiguring local workstation to mimic above Environment:

Environment

Ubuntu 20.04
ROS2 Foxy Fitzroy
Nvidia Driver 515.48.07
CUDA 11.7
CUDNN 8.4.1

Observations

Unreleased dockerized training workflow seems unaffected. Will seek to integrate by end Aug 2022.
Existing native training workflow is also unaffected. Unable to reproduce error.

The error source can be deduced to not be due to Nvidia Drivers version discrepancies. It could be Anaconda virtualization.

cardboardcode commented 2 years ago

Issue Reference

Cuda error: no kernel image is available for execution on the device from PyTorch

This may concern pytorch dependencies found in maskrcnn-benchmark when setting up P3-TrainFarm and P3-Exporter.

cardboardcode commented 2 years ago

This issue should be resolved with EPD v0.3.0 Pull Request #56.

Testing to be done on fresh workstation with the above environment factors replicated before closing.

mercedes149 commented 2 years ago

Training Error [ RESOLVED ] :heavy_check_mark:

Still encountering issue when running run_test_gui_gpu_local_only.bash. See below for terminal output error:

Expand to see details

```bash rosi@rosi-Precision-5520:~/easy_perception_deployment/easy_perception_deployment/gui$ bash run_test_gui_gpu_local_only.bash =========================================================================================== test session starts =========================================================================================== platform linux -- Python 3.6.13, pytest-6.2.4, py-1.11.0, pluggy-0.13.1 -- /home/rosi/anaconda3/envs/epd_gui_env/bin/python cachedir: .pytest_cache PySide2 5.15.2.1 -- Qt runtime 5.15.2 -- Qt compiled 5.15.2 rootdir: /home/rosi/easy_perception_deployment/easy_perception_deployment/gui, configfile: pytest.ini plugins: qt-4.0.2 collecting ... [ nvidia-smi ] command - FOUND [ nvcc ] command - FOUND GPU device FOUND. Proceeding... collected 20 items test_gui_gpu_local_only.py::test_P3Trainer_pullTrainFarmDockerImage latest: Pulling from cardboardcode/epd-trainer 92473f7ef455: Pull complete fb52bde70123: Pull complete 64788f86be3f: Pull complete 33f6d5f2e001: Pull complete 00e1b288fcc5: Pull complete 603fa6ac079b: Pull complete 18d7574b18d0: Pull complete c95c9a59aace: Pull complete a4c21b7e8698: Pull complete 38916087d014: Pull complete 5f89304ac2b2: Pull complete 20a528b75c59: Pull complete f56cdc62d3a7: Pull complete f3202a94d18c: Pull complete 25277479eced: Pull complete 93de04b756bd: Pull complete a3447685c288: Pull complete bbd252e7b76d: Pull complete 57d66d7bf1a2: Pull complete 96be7e3ba7ec: Pull complete 6e3a274d2811: Pull complete 5eb47a2eb350: Pull complete ae67e2f7a810: Pull complete Digest: sha256:7f4bfef36cc876f4043b2f59d2a278fd4f859df94e01b0666b79dbe7dbaad944 Status: Downloaded newer image for cardboardcode/epd-trainer:latest docker.io/cardboardcode/epd-trainer:latest PASSED test_gui_gpu_local_only.py::test_P3Trainer_createTrainFarmDockerContainer e1e0a56e0705ce96f52db2f7a6f8e645aec1856fdf960a1a9ad25cf0ec79a8ec PASSED test_gui_gpu_local_only.py::test_P3Trainer_installTrainingDependencies cp: cannot stat '../data/datasets/custom_dataset': No such file or directory Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_trainer Error response from daemon: Container e1e0a56e0705ce96f52db2f7a6f8e645aec1856fdf960a1a9ad25cf0ec79a8ec is not running FAILED test_gui_gpu_local_only.py::test_P3Trainer_copyTrainingFiles Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_trainer Error response from daemon: Container e1e0a56e0705ce96f52db2f7a6f8e645aec1856fdf960a1a9ad25cf0ec79a8ec is not running FAILED test_gui_gpu_local_only.py::test_P3Trainer_runTraining Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_trainer Error response from daemon: Container e1e0a56e0705ce96f52db2f7a6f8e645aec1856fdf960a1a9ad25cf0ec79a8ec is not running FAILED test_gui_gpu_local_only.py::test_P3Trainer_pullExporterDockerImage Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/create?fromImage=cardboardcode%2Fepd-exporter&tag=latest": dial unix /var/run/docker.sock: connect: permission denied FAILED test_gui_gpu_local_only.py::test_P3Trainer_createExportDockerContainer Unable to find image 'cardboardcode/epd-exporter:latest' locally latest: Pulling from cardboardcode/epd-exporter 92473f7ef455: Already exists fb52bde70123: Already exists 64788f86be3f: Already exists 33f6d5f2e001: Already exists 00e1b288fcc5: Already exists 603fa6ac079b: Already exists 18d7574b18d0: Already exists c95c9a59aace: Already exists a4c21b7e8698: Already exists 38916087d014: Already exists 5f89304ac2b2: Already exists 20a528b75c59: Already exists f56cdc62d3a7: Already exists f3202a94d18c: Already exists 25277479eced: Already exists 93de04b756bd: Already exists e0553b7a150d: Pull complete 9e44c5eef3a9: Pull complete c3996af73a84: Pull complete 58df4b82d8bb: Pull complete Digest: sha256:adc5f6d990f72bdb7b022f689a6eca58d3174960a8207bcb2a1c07af1d43f4ef Status: Downloaded newer image for cardboardcode/epd-exporter:latest b593dcb581fec70a552b8c7b7370754ac816c33c48ed08436950ea697e252ca0 PASSED test_gui_gpu_local_only.py::test_P3Trainer_installExporterDependencies Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_exporter Error response from daemon: Container b593dcb581fec70a552b8c7b7370754ac816c33c48ed08436950ea697e252ca0 is not running FAILED test_gui_gpu_local_only.py::test_P3Trainer_copyExportFiles Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_exporter Error response from daemon: Container b593dcb581fec70a552b8c7b7370754ac816c33c48ed08436950ea697e252ca0 is not running FAILED test_gui_gpu_local_only.py::test_P3Trainer_runExporter Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p3_exporter Error response from daemon: Container b593dcb581fec70a552b8c7b7370754ac816c33c48ed08436950ea697e252ca0 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_pullTrainFarmDockerImage PASSED test_gui_gpu_local_only.py::test_P2Trainer_createTrainFarmDockerContainer 89ff3e32b53e690f5b81a1b6bbfde1c6280d1edac2af86ab2e9ed3c3f587b840 PASSED test_gui_gpu_local_only.py::test_P2Trainer_installTrainingDependencies cp: cannot stat '../data/datasets/custom_dataset': No such file or directory Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_trainer Error response from daemon: Container 89ff3e32b53e690f5b81a1b6bbfde1c6280d1edac2af86ab2e9ed3c3f587b840 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_copyTrainingFiles Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_trainer Error response from daemon: Container 89ff3e32b53e690f5b81a1b6bbfde1c6280d1edac2af86ab2e9ed3c3f587b840 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_runTraining Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_trainer Error response from daemon: Container 89ff3e32b53e690f5b81a1b6bbfde1c6280d1edac2af86ab2e9ed3c3f587b840 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_pullExporterDockerImage PASSED test_gui_gpu_local_only.py::test_P2Trainer_createExportDockerContainer 6887894a0144a38f4af71f24575b957e1929078368be5efcb31359eeaa4479f4 PASSED test_gui_gpu_local_only.py::test_P2Trainer_installExporterDependencies Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_exporter Error response from daemon: Container 6887894a0144a38f4af71f24575b957e1929078368be5efcb31359eeaa4479f4 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_copyExportFiles Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_exporter Error response from daemon: Container 6887894a0144a38f4af71f24575b957e1929078368be5efcb31359eeaa4479f4 is not running FAILED test_gui_gpu_local_only.py::test_P2Trainer_runExporter Error response from daemon: could not select device driver "" with capabilities: [[gpu]] Error: failed to start containers: epd_p2_exporter Error response from daemon: Container 6887894a0144a38f4af71f24575b957e1929078368be5efcb31359eeaa4479f4 is not running FAILED ================================================================================================ FAILURES ================================================================================================= _______________________________________________________________________________ test_P3Trainer_installTrainingDependencies ________________________________________________________________________________ qtbot = def test_P3Trainer_installTrainingDependencies(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p3_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.installTrainingDependencies() # Check if _TRAIN_DOCKER_CONTAINER Docker Container # has been successfully created. cmd = [ "sudo", "docker", "inspect", "--type=container", _TRAIN_DOCKER_CONTAINER] docker_inspect_process = subprocess.Popen( cmd, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None) docker_inspect_process.communicate() assert docker_inspect_process.returncode == 0 > assert os.path.exists("p3_trainer") is True E AssertionError: assert False is True E + where False = ('p3_trainer') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:225: AssertionError ____________________________________________________________________________________ test_P3Trainer_copyTrainingFiles _____________________________________________________________________________________ qtbot = def test_P3Trainer_copyTrainingFiles(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p3_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.copyTrainingFiles() > assert os.path.exists("p3_trainer") is True E AssertionError: assert False is True E + where False = ('p3_trainer') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:255: AssertionError _______________________________________________________________________________________ test_P3Trainer_runTraining ________________________________________________________________________________________ qtbot = def test_P3Trainer_runTraining(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.runTraining() # Check if trained.pth has been generated in root. > assert os.path.exists("trained.pth") is True E AssertionError: assert False is True E + where False = ('trained.pth') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:301: AssertionError _________________________________________________________________________________ test_P3Trainer_pullExporterDockerImage __________________________________________________________________________________ qtbot = def test_P3Trainer_pullExporterDockerImage(qtbot): path_to_dataset = 'path_to_dummy_dataset' model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _EXPORT_DOCKER_IMG = "cardboardcode/epd-exporter:latest" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.pullExporterDockerImage() cmd = ["sudo", "docker", "inspect", "--type=image", _EXPORT_DOCKER_IMG] docker_inspect_process = subprocess.Popen( cmd, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None) docker_inspect_process.communicate() > assert docker_inspect_process.returncode == 0 E assert 1 == 0 E +1 E -0 test_gui_gpu_local_only.py:340: AssertionError _______________________________________________________________________________ test_P3Trainer_installExporterDependencies ________________________________________________________________________________ qtbot = def test_P3Trainer_installExporterDependencies(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _EXPORT_DOCKER_CONTAINER = "epd_p3_exporter" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.installExporterDependencies() # Check if _TRAIN_DOCKER_CONTAINER Docker Container # has been successfully created. cmd = [ "sudo", "docker", "inspect", "--type=container", _EXPORT_DOCKER_CONTAINER] docker_inspect_process = subprocess.Popen( cmd, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None) docker_inspect_process.communicate() assert docker_inspect_process.returncode == 0 > assert os.path.exists("p3_exporter") is True E AssertionError: assert False is True E + where False = ('p3_exporter') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:427: AssertionError _____________________________________________________________________________________ test_P3Trainer_copyExportFiles ______________________________________________________________________________________ qtbot = def test_P3Trainer_copyExportFiles(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p3_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.copyExportFiles() > assert os.path.exists("p3_exporter") is True E AssertionError: assert False is True E + where False = ('p3_exporter') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:457: AssertionError _______________________________________________________________________________________ test_P3Trainer_runExporter ________________________________________________________________________________________ qtbot = def test_P3Trainer_runExporter(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'maskrcnn' label_list = ['__ignore__', '_background_', 'teabox'] widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p3_trainer = P3Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p3_trainer.runExporter() # Check if trained.pth has been generated in root. > assert os.path.exists("output.onnx") is True E AssertionError: assert False is True E + where False = ('output.onnx') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:499: AssertionError -------------------------------------------------------------------------------------------- Captured log call -------------------------------------------------------------------------------------------- WARNING p3_train:P3Trainer.py:697 [ output.onnx ] - MISSING. Something must have failed before this. _______________________________________________________________________________ test_P2Trainer_installTrainingDependencies ________________________________________________________________________________ qtbot = def test_P2Trainer_installTrainingDependencies(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p2_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.installTrainingDependencies() # Check if _TRAIN_DOCKER_CONTAINER Docker Container # has been successfully created. cmd = [ "sudo", "docker", "inspect", "--type=container", _TRAIN_DOCKER_CONTAINER] docker_inspect_process = subprocess.Popen( cmd, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None) docker_inspect_process.communicate() assert docker_inspect_process.returncode == 0 > assert os.path.exists("p2_trainer") is True E AssertionError: assert False is True E + where False = ('p2_trainer') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:625: AssertionError ____________________________________________________________________________________ test_P2Trainer_copyTrainingFiles _____________________________________________________________________________________ qtbot = def test_P2Trainer_copyTrainingFiles(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p2_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.copyTrainingFiles() > assert os.path.exists("p2_trainer") is True E AssertionError: assert False is True E + where False = ('p2_trainer') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:655: AssertionError _______________________________________________________________________________________ test_P2Trainer_runTraining ________________________________________________________________________________________ qtbot = def test_P2Trainer_runTraining(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.runTraining() # Check if trained.pth has been generated in root. > assert os.path.exists("trained.pth") is True E AssertionError: assert False is True E + where False = ('trained.pth') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:701: AssertionError _______________________________________________________________________________ test_P2Trainer_installExporterDependencies ________________________________________________________________________________ qtbot = def test_P2Trainer_installExporterDependencies(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _EXPORT_DOCKER_CONTAINER = "epd_p2_exporter" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.installExporterDependencies() # Check if _TRAIN_DOCKER_CONTAINER Docker Container # has been successfully created. cmd = [ "sudo", "docker", "inspect", "--type=container", _EXPORT_DOCKER_CONTAINER] docker_inspect_process = subprocess.Popen( cmd, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None) docker_inspect_process.communicate() assert docker_inspect_process.returncode == 0 > assert os.path.exists("p2_exporter") is True E AssertionError: assert False is True E + where False = ('p2_exporter') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:826: AssertionError _____________________________________________________________________________________ test_P2Trainer_copyExportFiles ______________________________________________________________________________________ qtbot = def test_P2Trainer_copyExportFiles(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] _TRAIN_DOCKER_CONTAINER = "epd_p2_trainer" widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.copyExportFiles() > assert os.path.exists("p2_exporter") is True E AssertionError: assert False is True E + where False = ('p2_exporter') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:856: AssertionError _______________________________________________________________________________________ test_P2Trainer_runExporter ________________________________________________________________________________________ qtbot = def test_P2Trainer_runExporter(qtbot): global PATH_TO_TEST_TRAIN_DATASET path_to_dataset = PATH_TO_TEST_TRAIN_DATASET model_name = 'fasterrcnn' label_list = ['__ignore__', '_background_', 'teabox'] widget = TrainWindow(True) qtbot.addWidget(widget) widget.max_iteration = 100 widget.checkpoint_period = 100 widget.test_period = 100 widget.steps = '(100, 200, 300)' p2_trainer = P2Trainer( path_to_dataset, model_name, label_list, 100, 100, 100, '(100, 200, 300)') p2_trainer.runExporter() # Check if trained.pth has been generated in root. > assert os.path.exists("output.onnx") is True E AssertionError: assert False is True E + where False = ('output.onnx') E + where = .exists E + where = os.path test_gui_gpu_local_only.py:898: AssertionError -------------------------------------------------------------------------------------------- Captured log call -------------------------------------------------------------------------------------------- WARNING p2_train:P2Trainer.py:699 [ output.onnx ] - MISSING. Something must have failed before this. ========================================================================================= short test summary info ========================================================================================= FAILED test_gui_gpu_local_only.py::test_P3Trainer_installTrainingDependencies - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P3Trainer_copyTrainingFiles - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P3Trainer_runTraining - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P3Trainer_pullExporterDockerImage - assert 1 == 0 FAILED test_gui_gpu_local_only.py::test_P3Trainer_installExporterDependencies - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P3Trainer_copyExportFiles - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P3Trainer_runExporter - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_installTrainingDependencies - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_copyTrainingFiles - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_runTraining - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_installExporterDependencies - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_copyExportFiles - AssertionError: assert False is True FAILED test_gui_gpu_local_only.py::test_P2Trainer_runExporter - AssertionError: assert False is True ================================================================================ 13 failed, 7 passed in 501.61s (0:08:21) ================================================================================= ```

Error Abstraction

This is extracted from the aformentioned raw terminal log output:

Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

Solution :partying_face:

Install nvidia-docker2 and Reboot:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Verify :1st_place_medal:

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Reference

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-ubuntu-and-debian

mercedes149 commented 2 years ago

Environment

Ubuntu 20.04

Nvidia Driver 515.43.64


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M1200        On   | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8    N/A /  N/A |      7MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1111 G /usr/lib/xorg/Xorg 2MiB | | 0 N/A N/A 1659 G /usr/lib/xorg/Xorg 2MiB | +-----------------------------------------------------------------------------+


3. CUDA `11.7`
```bash
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

mercedes149 commented 2 years ago

Training Error [ RESOLVED ] :heavy_check_mark:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=30 : unknown error
Traceback (most recent call last):
  File "tools/train_net.py", line 201, in <module>
    main()
  File "tools/train_net.py", line 194, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 39, in train
    model.to(device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 432, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 230, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 430, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50

Error Abstraction

RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:50

Solution :partying_face:

Reboot. Verified to work.

References

https://discuss.pytorch.org/t/resolved-cuda-runtime-error-30/1116/13

mercedes149 commented 2 years ago

Training Error [ WIP ] :x:

The following error is encountered when Training for both P3 and P2 MaskRCNN models have already started. Installation of dependencies have succeeded.

RuntimeError: Not compiled with GPU support (nms at /home/user/p2_trainer/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f76f0e97273 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0x138 (0x7f76e14a3368 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x1a9c5 (0x7f76e14b39c5 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x18592 (0x7f76e14b1592 in /usr/local/lib/python3.6/dist-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #6: python() [0x5755f4]
frame #7: python() [0x57ea7b]
frame #8: python() [0x57da3c]
frame #10: python() [0x57521f]
frame #11: python() [0x57ea7b]
...

Error Abstraction

RuntimeError: Not compiled with GPU support (nms at /home/user/p2_trainer/maskrcnn_benchmark/csrc/nms.h:22)

Solution :x:

Pending...

cd $HOME 
git clone https://github.com/cardboardcode/easy_perception_deployment --branch dev --depth 1 public_epd
cd ~/public_epd/easy_perception_deployment/gui
# Comment out incremental GUI Local-Only GPU Training pytests
bash run_test_gui_gpu_local_only.bash
# Observe any failing pytests.

References

Screenshot from 2022-08-27 17-54-32 Search Issue 230 under https://github.com/facebookresearch/maskrcnn-benchmark.

cardboardcode commented 2 years ago

Closed with EPD v0.3.2. Verified with repeatable successful dockerized workflow for P2 and P3 training and exporting.

ros-industrial / easy_perception_deployment