mlcommons / inference_results_v3.0

This repository contains the results and code for the MLPerf™ Inference v3.0 benchmark.
https://mlcommons.org/en/inference-datacenter-30/
Apache License 2.0
18 stars 15 forks source link

NVIDIA make generate_engines Error Code 4: Internal Error Network has dynamic or shape inputs #12

Open wohenniubi opened 1 year ago

wohenniubi commented 1 year ago

The detailed error is as follows:

(mlperf) user@mlperf-inference-user-x86_64:/work$ make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=offline"
[2023-07-12 18:56:08,807 main.py:231 INFO] Detected system ID: KnownSystem.H100_PCIe_80GB_Custom
[2023-07-12 18:56:11,032 generate_engines.py:172 INFO] Building engines for bert benchmark in Offline scenario...
[2023-07-12 18:56:11,057 bert_var_seqlen.py:67 INFO] Using workspace size: 0
[07/12/2023-18:56:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 38, GPU 928 (MiB)
[07/12/2023-18:56:16] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2981, GPU +750, now: CPU 3096, GPU 1680 (MiB)
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[07/12/2023-18:56:18] [TRT] [I] Using default for use_int8_scale_max: true
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
...
[07/12/2023-18:56:18] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[2023-07-12 18:56:18,733 bert_var_seqlen.py:215 INFO] Building ./build/engines/H100_PCIe_80GB_Custom/bert/Offline/bert-Offline-gpu-_S_384_B_0_P_0_vs.custom_k_99_MaxP.plan
[07/12/2023-18:56:18] [TRT] [E] 4: [network.cpp::validate::3036] Error Code 4: Internal Error (Network has dynamic or shape inputs, but no optimization profile has been defined.)
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/actionhandler/base.py", line 189, in subprocess_target
    return self.action_handler.handle()
  File "/work/code/actionhandler/generate_engines.py", line 175, in handle
    total_engine_build_time += self.build_engine(job)
  File "/work/code/actionhandler/generate_engines.py", line 166, in build_engine
    builder.build_engines()
  File "/work/code/bert/tensorrt/bert_var_seqlen.py", line 231, in build_engines
    assert engine is not None, "Engine Build Failed!"
AssertionError: Engine Build Failed!

image

============= DETECTED SYSTEM ==============

SystemConfiguration: System ID (Optional Alias): H100_PCIe_80GB_Custom CPUConfiguration: 2x CPU (CPUArchitecture.x86_64): Intel(R) Xeon(R) Platinum 8480+ 56 Cores, 2 Threads/Core MemoryConfiguration: 528.08 GB (Matching Tolerance: 0.05) AcceleratorConfiguration: 2x GPU (0x233110DE): NVIDIA H100 PCIe AcceleratorType: Discrete SM Compute Capability: 90 Memory Capacity: 79.65 GiB Max Power Limit: 310.0 W NUMA Config String: &


![image](https://github.com/mlcommons/inference_results_v3.0/assets/13992754/fd455f8a-7fc2-4a7c-ac7d-e8f8cca4d2c8)

Thanks for any hint of this issue.
arjunsuresh commented 1 year ago

Here, "B0" means batch size used is 0 which is invalid.

We are supporting nvidia implementation inside CM and have currently tested on L4, T4, A100 and RTX 4090. We'll be very happy to assist you if you can test it on H100. Here, are the instructions: https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md and public discord channel for any queries.