Cannot get the performance number for Nvidia Orin

andyluo7 commented 2 years ago

I have an Orin developer Kit and followed the instructions to run mlcommons inference 2.0 on it. Orin is set to MAXN power mode.

The reported resnet50 throughput is 6,138.84fps in offline while what I can get is 4721.85fps. There is a 23% gap between the two.

I wonder if there is any settings I should enable to be closer to the reported performance number.

BR, Andy

psyhtest commented 2 years ago

Hi @andyluo7, your result is very similar to the published MaxQ result on the same platform: 4,750.26 samples per second. Have you enabled the MaxQ settings somehow?

psyhtest commented 2 years ago

Reproduced the issue with the following log:

 ======================= Perf harness results: =======================

 Orin_TRT-lwis_k_99_MaxP-Offline:
     resnet50: result_samples_per_second: 4752.66, Result is VALID

@nvpohanh @DilipSequeira

nvpohanh commented 2 years ago

@nv-ananjappa

psyhtest commented 2 years ago

It's even worse for BERT:

 ======================= Perf harness results: =======================

 Orin_TRT-custom_k_99_MaxP-Offline:
     bert: result_samples_per_second: 331.603, Result is VALID

 ======================= Accuracy results: =======================

 Orin_TRT-custom_k_99_MaxP-Offline:
     bert: No accuracy results in PerformanceOnly mode.

 real    16m22.142s
 user    0m7.079s
 sys     0m1.911s

(Expected MaxP: 476; expected MaxQ: 394.)

The system-under-test is Jetson Orin Developer Kit out-of-the-box, set up by running closed/Azure/scripts/install_orin_auto_dependencies.sh.

nv-ananjappa commented 2 years ago

@andyluo7 @psyhtest NVIDIA's Orin submission in MLPerf-Inference 2.0 is a preview submission that uses an internal build of the software that is not yet publicly available. Thus, the results in 2.0 submission are not fully reproducible with software that is publicly available right now from NVIDIA. We appreciate your patience until our available submission in August this year, at which time this software will be publicly released for full reproducibility.

psyhtest commented 2 years ago

Thank you @nv-ananjappa.

Based this SW note, I expected this might have been case for ResNet50 and SSD-ResNet34, but not for BERT? All other system information implies no further deviations from released components (e.g. TensorRT 8.4).

nv-ananjappa commented 2 years ago

@psyhtest There will be an updated version of the Jetpack 5.0 software in August that will be publicly available for reproducing all these results. We appreciate your patience until then.

psyhtest commented 2 years ago

Thank you @nv-ananjappa. I would like to corroborate your results in the v2.1 submission round if possible. Let's discuss if this can be arranged separately.

By the way, the Orin installation script misses the onnx package, which appears to be needed for RNNT.

psyhtest commented 2 years ago

Something interesting has happened! I suspect that the JetPack on our Orin Dev Kit has been upgraded from 5.0 DP to 5.0.1 DP. While the release notes say:

JetPack 5.0.1 Developer Preview is a development release and is a minor update to JetPack 5.0 Developer Preview and includes Jetson Linux 34.1.1. JetPack 5.0.1 brings support for DeepStream 6.1. All other features remain the same as JetPack 5.0 Developer Preview.

I got an error trying to run TensorRT engine plans (binaries) built only last week:

[E] [TRT] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 8.4.0.11 got 8.4.0.9, please rebuild.
[E] [TRT] 4: [runtime.cpp::deserializeCudaEngine::49] Error Code 4: Internal Error (Engine deserialization failed.)

This suggests TensorRT having been upgraded from 8.4.0.9 to 8.4.0.11, which is confirmed by the headers:

anton@orin:~$ grep NV_TENSORRT /usr/include/aarch64-linux-gnu/NvInferVersion.h
#define NV_TENSORRT_MAJOR 8 //!< TensorRT major version.
#define NV_TENSORRT_MINOR 4 //!< TensorRT minor version.
#define NV_TENSORRT_PATCH 0 //!< TensorRT patch version.
#define NV_TENSORRT_BUILD 11 //!< TensorRT build number.
#define NV_TENSORRT_SONAME_MAJOR 8 //!< Shared object library major version number.
#define NV_TENSORRT_SONAME_MINOR 4 //!< Shared object library minor version number.
#define NV_TENSORRT_SONAME_PATCH 0 //!< Shared object library patch version number.

psyhtest commented 2 years ago

I've rebuilt with TensorRT 8.4.0.11, and things are looking much better!

For BERT Offline, I've got 482.61 QPS vs 476.34 QPS submitted, i.e. +1.3%.

================================================
MLPerf Results Summary
================================================
SUT name : BERT SERVER
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 482.61
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 1286710393
Max latency (ns)                : 670105831356
Mean latency (ns)               : 407947859481
50.00 percentile latency (ns)   : 434832156292
90.00 percentile latency (ns)   : 638678649328
95.00 percentile latency (ns)   : 657189764504
97.00 percentile latency (ns)   : 663251585386
99.00 percentile latency (ns)   : 668215566704
99.90 percentile latency (ns)   : 670061579278
================================================
Test Parameters Used
================================================
samples_per_query : 323400
target_qps : 490
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 10833

For ResNet50, I've got 5,738.38 QPS vs 6,138.84 QPS submitted, i.e. -6.5%.

================================================
MLPerf Results Summary
================================================
SUT name : LWIS_Server
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 5738.38
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 61897145
Max latency (ns)                : 655585601158
Mean latency (ns)               : 327898632811
50.00 percentile latency (ns)   : 327940353830
90.00 percentile latency (ns)   : 590068967070
95.00 percentile latency (ns)   : 622852286735
97.00 percentile latency (ns)   : 635932675057
99.00 percentile latency (ns)   : 649063742940
99.90 percentile latency (ns)   : 654942756708
================================================
Test Parameters Used
================================================
samples_per_query : 3762000
target_qps : 5700
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 2048

As per my previous comment, I expect no material change for SSD-MobileNet and some degradation for SSD-ResNet34.

psyhtest commented 2 years ago

For SSD-MobileNet-v1 Offline, we've got 6,763.58 QPS vs 6,883.49 QPS submitted, i.e. -1.7%.

================================================
MLPerf Results Summary
================================================
SUT name : LWIS_Server
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 6763.58
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 120927983
Max latency (ns)                : 692827977849
Mean latency (ns)               : 347870241242
50.00 percentile latency (ns)   : 347130061046
90.00 percentile latency (ns)   : 623755650990
95.00 percentile latency (ns)   : 658348103847
97.00 percentile latency (ns)   : 672057411252
99.00 percentile latency (ns)   : 685890157755
99.90 percentile latency (ns)   : 692101057788
================================================
Test Parameters Used
================================================
samples_per_query : 4686000
target_qps : 7100
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 1024

For SSD-ResNet34, we've got 177.85 QPS vs 207.66 QPS submitted, i.e. -14.4%.

================================================
MLPerf Results Summary
================================================
SUT name : LWIS_Server
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 177.848
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 68448063
Max latency (ns)                : 686543022330
Mean latency (ns)               : 342938467611
50.00 percentile latency (ns)   : 342752321756
90.00 percentile latency (ns)   : 617669496881
95.00 percentile latency (ns)   : 652095607692
97.00 percentile latency (ns)   : 665957456573
99.00 percentile latency (ns)   : 679688620847
99.90 percentile latency (ns)   : 685900845499
================================================
Test Parameters Used
================================================
samples_per_query : 122100
target_qps : 185
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 64

dyling commented 2 years ago

inference_results_v2.0/closed/NVIDIA$ make run_harness RUN_ARGS="–benchmarks=resnet50 --scenarios=offline --test_mode=PerformanceOnly" Traceback (most recent call last): File "code/main.py", line 63, in from code.common.scopedMPS import ScopedMPS, turn_off_mps File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/scopedMPS.py", line 20, in from code.common.systems.system_list import SystemClassifications File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/systems/system_list.py", line 24, in from code.common.systems.accelerator import AcceleratorConfiguration, GPU, MIG File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/systems/accelerator.py", line 620, in ACCELERATOR_INFO_SOURCE: Final[InfoSource] = InfoSource(get_accelerator_info) File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/systems/info_source.py", line 38, in init self.reset(hard=True) File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/systems/info_source.py", line 43, in reset self.buffer = self.fn() File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/systems/accelerator.py", line 108, in get_accelerator_info info = run_command(cmd, get_output=True, tee=False, verbose=False) File "/disk_nvme/projects/MDC/4_mlperf/inference_results_v2.0/closed/NVIDIA/code/common/init.py", line 64, in run_command raise subprocess.CalledProcessError(ret, cmd) subprocess.CalledProcessError: Command 'CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi -L' returned non-zero exit status 9. make: *** [Makefile:700: run_harness] Error 1

nv-ananjappa commented 1 year ago

@dyling Since 2.1 code is now available, why not try with that? https://github.com/mlcommons/inference_results_v2.1

mlcommons / inference_results_v2.0

Cannot get the performance number for Nvidia Orin #1