mlcommons / inference_results_v2.0

This repository contains the results and code for the MLPerf™ Inference v2.0 benchmark.
https://mlcommons.org/en/inference-datacenter-20/
Apache License 2.0
9 stars 12 forks source link

cache_file must not be empty in NVIDIA config files (?) #5

Closed mahmoodn closed 2 years ago

mahmoodn commented 2 years ago

I have followed the instructions in NVIDIA folder to run the workloads on my RTX3080 machine. All steps were fine, but when I tried running the first workloads, I saw an error which I don't understand that.

(mlperf) mahmood@mlperf-inference-mahmood-x86_64:/work$ make run RUN_ARGS="--benchmarks=resnet50 --scenarios=offline"
make[1]: Entering directory '/work'
[2022-06-03 08:40:04,363 main.py:770 INFO] Detected System ID: KnownSystem.mahmood2022
[2022-06-03 08:40:05,241 main.py:108 INFO] Building engines for resnet50 benchmark in Offline scenario...
[2022-06-03 08:40:05,242 main.py:128 INFO] Building GPU engine for mahmood2022_resnet50_Offline
[2022-06-03 08:40:05,291 ResNet50.py:39 INFO] Using workspace size: 0
[06/03/2022-08:40:05] [TRT] [I] [MemUsageChange] Init CUDA: CPU +349, GPU +0, now: CPU 381, GPU 782 (MiB)
[06/03/2022-08:40:06] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 400 MiB, GPU 782 MiB
[06/03/2022-08:40:06] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 775 MiB, GPU 904 MiB
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 134, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 169, in build_engines
    self.initialize()
  File "/work/code/resnet50/tensorrt/ResNet50.py", line 81, in initialize
    rn50_gs = RN50GraphSurgeon(self.model_path,
  File "/work/code/resnet50/tensorrt/rn50_graphsurgeon.py", line 310, in __init__
    if os.path.exists(self.cache_file):
  File "/usr/lib/python3.8/genericpath.py", line 19, in exists
    os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Traceback (most recent call last):
  File "code/main.py", line 772, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "code/main.py", line 744, in main
    dispatch_action(main_args, config_dict, workload_id, equiv_engine_setting=equiv_engine_setting)
  File "code/main.py", line 553, in dispatch_action
    launch_handle_generate_engine(*_gen_args, **_gen_kwargs)
  File "code/main.py", line 92, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
make[1]: *** [Makefile:691: generate_engines] Error 1
make[1]: Leaving directory '/work'
make: *** [Makefile:685: run] Error 2

I checked the system config which has been created like this:

@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP)
class MAHMOOD2022(OfflineGPUBaseConfig):
    system = KnownSystem.mahmood2022

    # Applicable fields for this benchmark are listed below. Not all of these are necessary, and some may be defined in the BaseConfig already and inherited.
    # Please see NVIDIA's submission config files for example values and which fields to keep.
    # Required fields (Must be set or inherited to run):
    input_dtype: str = ''
    map_path: str = ''
    precision: str = ''
    tensor_path: str = ''

    # Optional fields:
    active_sms: int = 0
    assume_contiguous: bool = False
    buffer_manager_thread_count: int = 0
    cache_file: str = ''

So, the cache_file is empty. The same empty statement exists in the readme file for the example system A30X4_CUSTOM. How can I narrow down the error and fix it? Any idea about that?

nv-alicheng commented 2 years ago

Hi mahmoodn, The script generates a stub, rather than a complete configuration. You are still required to fill out the fields that are needed to run the benchmark on your specific system. You can use an existing configuration, such as the A100-PCIEx1 configuration as an example.

The cache_file is the path to the INT8 calibration cache file. You can simply delete this key from the config and it will use the default path.