Resnet50 error while building engines (NVIDIA)

Following NVIDIA readme file and after adding the system configuration and building the workloads, I tried to run resnet50 benchmark but at the beginning of the execution, I get the following error:

(mlperf) mahmood@mlperf-inference-mahmood-x86-64-26486:/work$ make run RUN_ARGS="--benchmarks=resnet50 --scenarios=offline"

make[1]: Entering directory '/work'
[2024-06-28 07:33:14,784 main.py:229 INFO] Detected system ID: KnownSystem.rtx3080_ryzen3700x
[2024-06-28 07:33:15,654 generate_engines.py:173 INFO] Building engines for resnet50 benchmark in Offline scenario...
[06/28/2024-07:33:15] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 35, GPU 823 (MiB)
[06/28/2024-07:33:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +306, now: CPU 1969, GPU 1135 (MiB)
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/actionhandler/base.py", line 189, in subprocess_target
    return self.action_handler.handle()
  File "/work/code/actionhandler/generate_engines.py", line 176, in handle
    total_engine_build_time += self.build_engine(job)
  File "/work/code/actionhandler/generate_engines.py", line 159, in build_engine
    builder = get_benchmark(job.config)
  File "/work/code/__init__.py", line 87, in get_benchmark
    return cls(conf)
  File "/work/code/resnet50/tensorrt/ResNet50.py", line 332, in __init__
    super().__init__(ResNet50EngineBuilderOp(**args))
  File "/work/code/resnet50/tensorrt/ResNet50.py", line 148, in __init__
    if self.batch_size % self.gpu_res2res3_loop_count != 0:
ZeroDivisionError: integer division or modulo by zero
[2024-06-28 07:33:19,719 generate_engines.py:173 INFO] Building engines for resnet50 benchmark in Offline scenario...
[06/28/2024-07:33:19] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 35, GPU 823 (MiB)
[06/28/2024-07:33:21] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +310, now: CPU 1969, GPU 1139 (MiB)
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/actionhandler/base.py", line 189, in subprocess_target
    return self.action_handler.handle()
  File "/work/code/actionhandler/generate_engines.py", line 176, in handle
    total_engine_build_time += self.build_engine(job)
  File "/work/code/actionhandler/generate_engines.py", line 159, in build_engine
    builder = get_benchmark(job.config)
  File "/work/code/__init__.py", line 87, in get_benchmark
    return cls(conf)
  File "/work/code/resnet50/tensorrt/ResNet50.py", line 332, in __init__
    super().__init__(ResNet50EngineBuilderOp(**args))
  File "/work/code/resnet50/tensorrt/ResNet50.py", line 148, in __init__
    if self.batch_size % self.gpu_res2res3_loop_count != 0:
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work/code/main.py", line 231, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "/work/code/main.py", line 144, in main
    dispatch_action(main_args, config_dict, workload_setting)
  File "/work/code/main.py", line 202, in dispatch_action
    handler.run()
  File "/work/code/actionhandler/base.py", line 82, in run
    self.handle_failure()
  File "/work/code/actionhandler/base.py", line 186, in handle_failure
    self.action_handler.handle_failure()
  File "/work/code/actionhandler/generate_engines.py", line 184, in handle_failure
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
make[1]: *** [Makefile:37: generate_engines] Error 1
make[1]: Leaving directory '/work'
make: *** [Makefile:31: run] Error 2

The default generated configuration in configs/resnet50/Offline is shown below:

# Generated file by scripts/custom_systems/add_custom_system.py
# Contains configs for all custom systems in code/common/systems/custom_list.py

from . import *

@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP)
class RTX3080_RYZEN3700X(OfflineGPUBaseConfig):
    system = KnownSystem.rtx3080_ryzen3700x

    # Applicable fields for this benchmark are listed below. Not all of these are necessary, and some may be defined in the BaseConfig already and inherited.
    # Please see NVIDIA's submission config files for example values and which fields to keep.
    # Required fields (Must be set or inherited to run):
    gpu_batch_size: int = 0
    input_dtype: str = ''
    input_format: str = ''
    map_path: str = ''
    precision: str = ''
    tensor_path: str = ''

    # Optional fields:
    active_sms: int = 0
    assume_contiguous: bool = False
    buffer_manager_thread_count: int = 0
    cache_file: str = ''
    complete_threads: int = 0
    deque_timeout_usec: int = 0
    disable_beta1_smallk: bool = False
    energy_aware_kernels: bool = False
    gpu_copy_streams: int = 0
    gpu_inference_streams: int = 0
    gpu_res2res3_loop_count: int = 0
    instance_group_count: int = 0
    model_path: str = ''
    offline_expected_qps: float = 0.0
    performance_sample_count_override: int = 0
    preferred_batch_size: str = ''
    request_timeout_usec: int = 0
    run_infer_on_copy_streams: bool = False
    use_batcher_thread_per_device: bool = False
    use_cuda_thread_per_device: bool = False
    use_deque_limit: bool = False
    use_graphs: bool = False
    use_jemalloc: bool = False
    use_same_context: bool = False
    use_spin_wait: bool = False
    verbose_glog: int = 0
    warmup_duration: float = 0.0
    workspace_size: int = 0

@ConfigRegistry.register(HarnessType.Triton, AccuracyTarget.k_99, PowerSetting.MaxP)
class RTX3080_RYZEN3700X_Triton(RTX3080_RYZEN3700X):
    use_triton = True

    # Applicable fields for this benchmark are listed below. Not all of these are necessary, and some may be defined in the BaseConfig already and inherited.
    # Please see NVIDIA's submission config files for example values and which fields to keep.
    # Required fields (Must be set or inherited to run):
    gpu_batch_size: int = 0
    input_dtype: str = ''
    input_format: str = ''
    map_path: str = ''
    precision: str = ''
    tensor_path: str = ''

    # Optional fields:
    active_sms: int = 0
    assume_contiguous: bool = False
    batch_triton_requests: bool = False
    buffer_manager_thread_count: int = 0
    cache_file: str = ''
    complete_threads: int = 0
    deque_timeout_usec: int = 0
    disable_beta1_smallk: bool = False
    energy_aware_kernels: bool = False
    gather_kernel_buffer_threshold: int = 0
    gpu_copy_streams: int = 0
    gpu_inference_streams: int = 0
    gpu_res2res3_loop_count: int = 0
    instance_group_count: int = 0
    max_queue_delay_usec: int = 0
    model_path: str = ''
    num_concurrent_batchers: int = 0
    num_concurrent_issuers: int = 0
    offline_expected_qps: float = 0.0
    output_pinned_memory: bool = False
    performance_sample_count_override: int = 0
    preferred_batch_size: str = ''
    request_timeout_usec: int = 0
    run_infer_on_copy_streams: bool = False
    use_batcher_thread_per_device: bool = False
    use_concurrent_harness: bool = False
    use_cuda_thread_per_device: bool = False
    use_deque_limit: bool = False
    use_graphs: bool = False
    use_jemalloc: bool = False
    use_same_context: bool = False
    use_spin_wait: bool = False
    verbose_glog: int = 0
    warmup_duration: float = 0.0
    workspace_size: int = 0

I thought that gpu_batch_size: int = 0 is causing the problem, but changing that to 1 resulted in the same error. I also checked that nvidia-smi works as below:

(mlperf) mahmood@mlperf-inference-mahmood-x86-64-26486:/work$ nvidia-smi 
Fri Jun 28 07:42:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:2D:00.0  On |                  N/A |
|  0%   54C    P8              33W / 370W |    239MiB / 10240MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Any idea about that?

mlcommons / inference_results_v4.0

Resnet50 error while building engines (NVIDIA) #12