Orin TensorRT MaxQ configuration

psyhtest commented 2 years ago

I've built two TensorRT engine plans for BERT Offline as follows:

$ make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=offline"
$ make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=offline --config_ver=maxq"
$ find build/engines/Orin/bert/Offline/ -name *.plan -exec du -hs {} \;
355M    build/engines/Orin/bert/Offline/bert-Offline-gpu-int8_S_384_B_256_P_1_vs_il.custom_k_99_MaxP.plan
355M    build/engines/Orin/bert/Offline/bert-Offline-gpu-int8_S_384_B_256_P_1_vs_il.custom_k_99_MaxQ.plan

(The files have practically the same size but are slightly different perhaps due to included timestamps?)

I've validated that if I run either plan with:

$ time MLPERF_SCRATCH_PATH=/datasets/nvidia_scratch make run_harness \
RUN_ARGS="--benchmarks=bert --scenarios=offline --test_mode=PerformanceOnly"

the performance is up to 489 samples/second and the power drawn is around 78 W (as observed on the front panel of the attached WT310E power meter). (To validate this for the *.MaxQ.plan file in the same setting, I copied it over the *.MaxP.plan file and re-run the experiment.)

================================================                                                                                                                                                        MLPerf Results Summary
================================================
SUT name : BERT SERVER
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 488.866
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 1284470274
Max latency (ns)                : 661531233338
Mean latency (ns)               : 403120249345
50.00 percentile latency (ns)   : 429991798154
90.00 percentile latency (ns)   : 630211888645
95.00 percentile latency (ns)   : 648689537962
97.00 percentile latency (ns)   : 654728337926
99.00 percentile latency (ns)   : 659634040609
99.90 percentile latency (ns)   : 661485062456
================================================
Test Parameters Used
================================================
samples_per_query : 323400
target_qps : 490
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 10833

No warnings encountered during test.

No errors encountered during test.

psyhtest commented 2 years ago

However, when I launch the MaxQ configuration:

$ time MLPERF_SCRATCH_PATH=/datasets/nvidia_scratch make run_harness \
RUN_ARGS="--benchmarks=bert --scenarios=offline --test_mode=PerformanceOnly --config_ver=maxq"

the performance is only up to 285 QPS and the power drawn is around 36.5 Watts.

================================================
MLPerf Results Summary
================================================
SUT name : BERT SERVER
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 284.539
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
================================================
Additional Stats
================================================
Min latency (ns)                : 2144816516
Max latency (ns)                : 649472731078
Mean latency (ns)               : 396134010983
50.00 percentile latency (ns)   : 422093178203
90.00 percentile latency (ns)   : 619288975526
95.00 percentile latency (ns)   : 636880414359
97.00 percentile latency (ns)   : 643005549978
99.00 percentile latency (ns)   : 647674403103
99.90 percentile latency (ns)   : 649472605475
================================================
Test Parameters Used
================================================
samples_per_query : 184800
target_qps : 280
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 10833

The submitted result is: 394.33 QPS @ 53.59 W (7.36 QPS/W).

The fact that NVIDIA used an internal version of the software does not seem to matter for BERT, so I'm wondering why I cannot reproduce the submitted MaxQ performance/power results?

psyhtest commented 2 years ago

$ cat build/nvpmodel.temp.conf
< PARAM TYPE=FILE NAME=CPU_ONLINE >
CORE_0 /sys/devices/system/cpu/cpu0/online
CORE_1 /sys/devices/system/cpu/cpu1/online
CORE_2 /sys/devices/system/cpu/cpu2/online
CORE_3 /sys/devices/system/cpu/cpu3/online
CORE_4 /sys/devices/system/cpu/cpu4/online
CORE_5 /sys/devices/system/cpu/cpu5/online
CORE_6 /sys/devices/system/cpu/cpu6/online
CORE_7 /sys/devices/system/cpu/cpu7/online
CORE_8 /sys/devices/system/cpu/cpu8/online
CORE_9 /sys/devices/system/cpu/cpu9/online
CORE_10 /sys/devices/system/cpu/cpu10/online
CORE_11 /sys/devices/system/cpu/cpu11/online

< PARAM TYPE=FILE NAME=TPC_POWER_GATING >
TPC_PG_MASK /sys/devices/gpu.0/tpc_pg_mask

< PARAM TYPE=FILE NAME=GPU_POWER_CONTROL_ENABLE >
GPU_PWR_CNTL_EN /sys/devices/gpu.0/power/control

< PARAM TYPE=FILE NAME=GPU_POWER_CONTROL_DISABLE >
GPU_PWR_CNTL_DIS /sys/devices/gpu.0/power/control

< PARAM TYPE=CLOCK NAME=CPU_A78_0 >
FREQ_TABLE /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq

< PARAM TYPE=CLOCK NAME=CPU_A78_1 >
FREQ_TABLE /sys/devices/system/cpu/cpu4/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq

< PARAM TYPE=CLOCK NAME=CPU_A78_2 >
FREQ_TABLE /sys/devices/system/cpu/cpu8/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu8/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_min_freq

< PARAM TYPE=CLOCK NAME=GPU >
FREQ_TABLE /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/available_frequencies
MAX_FREQ /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/max_freq
MIN_FREQ /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/min_freq
FREQ_TABLE_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/available_frequencies
MAX_FREQ_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/max_freq
MIN_FREQ_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/min_freq

< PARAM TYPE=CLOCK NAME=DLA0_CORE >
MAX_FREQ /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_core
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_core

< PARAM TYPE=CLOCK NAME=DLA0_FALCON >
MAX_FREQ /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_falcon
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_falcon

< PARAM TYPE=CLOCK NAME=DLA1_CORE >
MAX_FREQ /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_core
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_core

< PARAM TYPE=CLOCK NAME=DLA1_FALCON >
MAX_FREQ /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_falcon
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_falcon

< PARAM TYPE=CLOCK NAME=PVA0_VPS >
MAX_FREQ /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_vps
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_vps

< PARAM TYPE=CLOCK NAME=PVA0_AXI >
MAX_FREQ /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_cpu_axi
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_cpu_axi

< PARAM TYPE=FILE NAME=EMC >
MAX_FREQ /sys/kernel/nvpmodel_emc_cap/emc_iso_cap

< POWER_MODEL ID=0 NAME=MAX_Q >
CPU_ONLINE CORE_0 1
CPU_ONLINE CORE_1 1
CPU_ONLINE CORE_2 1
CPU_ONLINE CORE_3 1
CPU_ONLINE CORE_4 0
CPU_ONLINE CORE_5 0
CPU_ONLINE CORE_6 0
CPU_ONLINE CORE_7 0
CPU_ONLINE CORE_8 0
CPU_ONLINE CORE_9 0
CPU_ONLINE CORE_10 0
CPU_ONLINE CORE_11 0
TPC_POWER_GATING TPC_PG_MASK 0
GPU_POWER_CONTROL_ENABLE GPU_PWR_CNTL_EN on
CPU_A78_0 MIN_FREQ 1036800
CPU_A78_0 MAX_FREQ 1036800
CPU_A78_1 MIN_FREQ 1036800
CPU_A78_1 MAX_FREQ 1036800
CPU_A78_2 MIN_FREQ 1036800
CPU_A78_2 MAX_FREQ 1036800
GPU MIN_FREQ 726750000
GPU MAX_FREQ 726750000
GPU_POWER_CONTROL_DISABLE GPU_PWR_CNTL_DIS auto
DLA0_CORE MAX_FREQ 0
DLA1_CORE MAX_FREQ 0
DLA0_FALCON MAX_FREQ -1
DLA1_FALCON MAX_FREQ -1
PVA0_VPS MAX_FREQ -1
PVA0_AXI MAX_FREQ -1
EMC MAX_FREQ 2133000000
# mandatory section to configure the default power mode
< PM_CONFIG DEFAULT=0 >

psyhtest commented 2 years ago

@nv-ananjappa

psyhtest commented 2 years ago

In summary, it looks like we have reproduced the expected performance according to the submitted configuration (285 QPS):

@ConfigRegistry.register(HarnessType.Custom, AccuracyTarget.k_99, PowerSetting.MaxQ)
class Orin_MaxQ(Orin):
    soc_cpu_freq = 1036800
    soc_gpu_freq = 726750000
    soc_dla_freq = 0
    soc_emc_freq = 2133000000
    orin_num_cores = 4
    offline_expected_qps = 280

but it's NOT the same configuration that NVIDIA used for the submitted experiment (394 QPS).

mlcommons / inference_results_v2.0

Orin TensorRT MaxQ configuration #3