mlcommons / cm4mlops

A collection of portable, reusable and cross-platform automation recipes (CM scripts) with a human-friendly interface and minimal dependencies to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data sets, software and hardware (cloud/edge)
http://docs.mlcommons.org/cm4mlops/
Apache License 2.0
7 stars 12 forks source link

Container build fails for 3D-UNet-99 #78

Closed WarrenSchultz closed 1 week ago

WarrenSchultz commented 2 weeks ago

Running the command for ResNet50 works correctly: cm run script --tags=run-mlperf,inference,_performance-only,_full --division=open --category=edge --device=cuda --model=resnet50 --precision=float32 --implementation=nvidia --backend=tensorrt --scenario=Offline --execution_mode=valid --power=no --adr.python.version_min=3.8 --clean --compliance=no --quiet --time --docker --docker_cache=no

But 3d-unet-99 fails cm run script --tags=run-mlperf,inference,_performance-only,_full --division=open --category=edge --device=cuda --model=3d-unet-99 --precision=float32 --implementation=nvidia --backend=tensorrt --scenario=Offline --execution_mode=valid --power=no --adr.python.version_min=3.8 --clean --compliance=no --quiet --time --docker --docker_cache=no

Error log: `Loading TensorRT plugin from build/plugins/conv3D3X3X3C1K32Plugin/libconv3D3X3X3C1K32Plugin.so Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 176, in handle total_engine_build_time += self.build_engine(job) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 159, in build_engine builder = get_benchmark(job.config) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 83, in get_benchmark cls = get_cls(G_BENCHMARK_CLASS_MAP[benchmark]) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 66, in get_cls return getattr(import_module(module_loc.module_path), module_loc.cls_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 848, in exec_module File "", line 219, in _call_with_frames_removed File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/3d-unet/tensorrt/3d-unet.py", line 25, in import onnx ModuleNotFoundError: No module named 'onnx' [2024-06-19 10:30:07,499 generate_engines.py:173 INFO] Building engines for 3d-unet benchmark in Offline scenario... Loading TensorRT plugin from build/plugins/pixelShuffle3DPlugin/libpixelshuffle3dplugin.so Loading TensorRT plugin from build/plugins/conv3D1X1X1K4Plugin/libconv3D1X1X1K4Plugin.so Loading TensorRT plugin from build/plugins/conv3D3X3X3C1K32Plugin/libconv3D3X3X3C1K32Plugin.so Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 176, in handle total_engine_build_time += self.build_engine(job) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 159, in build_engine builder = get_benchmark(job.config) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 83, in get_benchmark cls = get_cls(G_BENCHMARK_CLASS_MAP[benchmark]) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/init.py", line 66, in get_cls return getattr(import_module(module_loc.module_path), module_loc.cls_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 848, in exec_module File "", line 219, in _call_with_frames_removed File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/3d-unet/tensorrt/3d-unet.py", line 25, in import onnx ModuleNotFoundError: No module named 'onnx' Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 231, in main(main_args, DETECTED_SYSTEM) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 144, in main dispatch_action(main_args, config_dict, workload_setting) File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/main.py", line 202, in dispatch_action handler.run() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 82, in run self.handle_failure() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/base.py", line 186, in handle_failure self.action_handler.handle_failure() File "/home/cmuser/CM/repos/local/cache/be4b540d34434756/repo/closed/NVIDIA/code/actionhandler/generate_engines.py", line 184, in handle_failure raise RuntimeError("Building engines failed!") RuntimeError: Building engines failed! make: [Makefile:37: generate_engines] Error 1

CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)`

However, running 3d-unet-99 within the container built for ResNet50 works correctly.

arjunsuresh commented 1 week ago

Thanks for reporting this. The problem should be fixed now. We typically launch one docker image for nvidia implementation and run all the benchmarks there - so missed this issue for 3d-unet.

WarrenSchultz commented 1 week ago

Seems to be working now, thanks!