Container build fails for 3D-UNet-99

WarrenSchultz commented 2 weeks ago

Running the command for ResNet50 works correctly: cm run script --tags=run-mlperf,inference,_performance-only,_full --division=open --category=edge --device=cuda --model=resnet50 --precision=float32 --implementation=nvidia --backend=tensorrt --scenario=Offline --execution_mode=valid --power=no --adr.python.version_min=3.8 --clean --compliance=no --quiet --time --docker --docker_cache=no

But 3d-unet-99 fails cm run script --tags=run-mlperf,inference,_performance-only,_full --division=open --category=edge --device=cuda --model=3d-unet-99 --precision=float32 --implementation=nvidia --backend=tensorrt --scenario=Offline --execution_mode=valid --power=no --adr.python.version_min=3.8 --clean --compliance=no --quiet --time --docker --docker_cache=no

Error log: `Loading TensorRT plugin from build/plugins/conv3D3X3X3C1K32Plugin/libconv3D3X3X3C1K32Plugin.so Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 176, in handle total_engine_build_time += self.build_engine(job) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 159, in build_engine builder = get_benchmark(job.config) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 83, in get_benchmark cls = get_cls(G_BENCHMARK_CLASS_MAP[benchmark]) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 66, in get_cls return getattr(import_module(module_loc.module_path), module_loc.cls_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 848, in exec_module File "", line 219, in _call_with_frames_removed File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/3d-unet/tensorrt/3d-unet.py", line 25, in import onnx ModuleNotFoundError: No module named 'onnx' [2024-06-19 10:30:07,499 generate_engines.py:173 INFO] Building engines for 3d-unet benchmark in Offline scenario... Loading TensorRT plugin from build/plugins/pixelShuffle3DPlugin/libpixelshuffle3dplugin.so Loading TensorRT plugin from build/plugins/conv3D1X1X1K4Plugin/libconv3D1X1X1K4Plugin.so Loading TensorRT plugin from build/plugins/conv3D3X3X3C1K32Plugin/libconv3D3X3X3C1K32Plugin.so Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 176, in handle total_engine_build_time += self.build_engine(job) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 159, in build_engine builder = get_benchmark(job.config) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 83, in get_benchmark cls = get_cls(G_BENCHMARK_CLASS_MAP[benchmark]) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 66, in get_cls return getattr(import_module(module_loc.module_path), module_loc.cls_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 848, in exec_module File "", line 219, in _call_with_frames_removed File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/3d-unet/tensorrt/3d-unet.py", line 25, in import onnx ModuleNotFoundError: No module named 'onnx' Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 231, in main(main_args, DETECTED_SYSTEM) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 144, in main dispatch_action(main_args, config_dict, workload_setting) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 202, in dispatch_action handler.run() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 82, in run self.handle_failure() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 186, in handle_failure self.action_handler.handle_failure() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 184, in handle_failure raise RuntimeError("Building engines failed!") RuntimeError: Building engines failed! make: [Makefile:37: generate_engines] Error 1

CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)`

However, running 3d-unet-99 within the container built for ResNet50 works correctly.

arjunsuresh commented 1 week ago

Thanks for reporting this. The problem should be fixed now. We typically launch one docker image for nvidia implementation and run all the benchmarks there - so missed this issue for 3d-unet.

WarrenSchultz commented 1 week ago

Seems to be working now, thanks!

mlcommons / cm4mlops

Container build fails for 3D-UNet-99 #78