Open psyhtest opened 2 years ago
However, when I launch the MaxQ configuration:
$ time MLPERF_SCRATCH_PATH=/datasets/nvidia_scratch make run_harness \
RUN_ARGS="--benchmarks=bert --scenarios=offline --test_mode=PerformanceOnly --config_ver=maxq"
the performance is only up to 285 QPS and the power drawn is around 36.5 Watts.
================================================ MLPerf Results Summary ================================================ SUT name : BERT SERVER Scenario : Offline Mode : PerformanceOnly Samples per second: 284.539 Result is : VALID Min duration satisfied : Yes Min queries satisfied : Yes Early stopping satisfied: Yes ================================================ Additional Stats ================================================ Min latency (ns) : 2144816516 Max latency (ns) : 649472731078 Mean latency (ns) : 396134010983 50.00 percentile latency (ns) : 422093178203 90.00 percentile latency (ns) : 619288975526 95.00 percentile latency (ns) : 636880414359 97.00 percentile latency (ns) : 643005549978 99.00 percentile latency (ns) : 647674403103 99.90 percentile latency (ns) : 649472605475 ================================================ Test Parameters Used ================================================ samples_per_query : 184800 target_qps : 280 target_latency (ns): 0 max_async_queries : 1 min_duration (ms): 600000 max_duration (ms): 0 min_query_count : 1 max_query_count : 0 qsl_rng_seed : 6655344265603136530 sample_index_rng_seed : 15863379492028895792 schedule_rng_seed : 12662793979680847247 accuracy_log_rng_seed : 0 accuracy_log_probability : 0 accuracy_log_sampling_target : 0 print_timestamps : 0 performance_issue_unique : 0 performance_issue_same : 0 performance_issue_same_index : 0 performance_sample_count : 10833
The submitted result is: 394.33 QPS @ 53.59 W (7.36 QPS/W).
The fact that NVIDIA used an internal version of the software does not seem to matter for BERT, so I'm wondering why I cannot reproduce the submitted MaxQ performance/power results?
$ cat build/nvpmodel.temp.conf
< PARAM TYPE=FILE NAME=CPU_ONLINE >
CORE_0 /sys/devices/system/cpu/cpu0/online
CORE_1 /sys/devices/system/cpu/cpu1/online
CORE_2 /sys/devices/system/cpu/cpu2/online
CORE_3 /sys/devices/system/cpu/cpu3/online
CORE_4 /sys/devices/system/cpu/cpu4/online
CORE_5 /sys/devices/system/cpu/cpu5/online
CORE_6 /sys/devices/system/cpu/cpu6/online
CORE_7 /sys/devices/system/cpu/cpu7/online
CORE_8 /sys/devices/system/cpu/cpu8/online
CORE_9 /sys/devices/system/cpu/cpu9/online
CORE_10 /sys/devices/system/cpu/cpu10/online
CORE_11 /sys/devices/system/cpu/cpu11/online
< PARAM TYPE=FILE NAME=TPC_POWER_GATING >
TPC_PG_MASK /sys/devices/gpu.0/tpc_pg_mask
< PARAM TYPE=FILE NAME=GPU_POWER_CONTROL_ENABLE >
GPU_PWR_CNTL_EN /sys/devices/gpu.0/power/control
< PARAM TYPE=FILE NAME=GPU_POWER_CONTROL_DISABLE >
GPU_PWR_CNTL_DIS /sys/devices/gpu.0/power/control
< PARAM TYPE=CLOCK NAME=CPU_A78_0 >
FREQ_TABLE /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
< PARAM TYPE=CLOCK NAME=CPU_A78_1 >
FREQ_TABLE /sys/devices/system/cpu/cpu4/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq
< PARAM TYPE=CLOCK NAME=CPU_A78_2 >
FREQ_TABLE /sys/devices/system/cpu/cpu8/cpufreq/scaling_available_frequencies
MAX_FREQ /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq
MIN_FREQ /sys/devices/system/cpu/cpu8/cpufreq/scaling_min_freq
FREQ_TABLE_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_available_frequencies
MAX_FREQ_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_max_freq
MIN_FREQ_KNEXT /sys/devices/system/cpu/cpu8/cpufreq/scaling_min_freq
< PARAM TYPE=CLOCK NAME=GPU >
FREQ_TABLE /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/available_frequencies
MAX_FREQ /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/max_freq
MIN_FREQ /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/min_freq
FREQ_TABLE_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/available_frequencies
MAX_FREQ_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/max_freq
MIN_FREQ_KNEXT /sys/devices/17000000.ga10b/devfreq_dev/min_freq
< PARAM TYPE=CLOCK NAME=DLA0_CORE >
MAX_FREQ /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_core
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_core
< PARAM TYPE=CLOCK NAME=DLA0_FALCON >
MAX_FREQ /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_falcon
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/15880000.nvdla0/acm/clk_cap/dla0_falcon
< PARAM TYPE=CLOCK NAME=DLA1_CORE >
MAX_FREQ /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_core
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_core
< PARAM TYPE=CLOCK NAME=DLA1_FALCON >
MAX_FREQ /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_falcon
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/158c0000.nvdla1/acm/clk_cap/dla1_falcon
< PARAM TYPE=CLOCK NAME=PVA0_VPS >
MAX_FREQ /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_vps
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_vps
< PARAM TYPE=CLOCK NAME=PVA0_AXI >
MAX_FREQ /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_cpu_axi
MAX_FREQ_KNEXT /sys/devices/platform/13e40000.host1x/16000000.pva0/acm/clk_cap/pva0_cpu_axi
< PARAM TYPE=FILE NAME=EMC >
MAX_FREQ /sys/kernel/nvpmodel_emc_cap/emc_iso_cap
< POWER_MODEL ID=0 NAME=MAX_Q >
CPU_ONLINE CORE_0 1
CPU_ONLINE CORE_1 1
CPU_ONLINE CORE_2 1
CPU_ONLINE CORE_3 1
CPU_ONLINE CORE_4 0
CPU_ONLINE CORE_5 0
CPU_ONLINE CORE_6 0
CPU_ONLINE CORE_7 0
CPU_ONLINE CORE_8 0
CPU_ONLINE CORE_9 0
CPU_ONLINE CORE_10 0
CPU_ONLINE CORE_11 0
TPC_POWER_GATING TPC_PG_MASK 0
GPU_POWER_CONTROL_ENABLE GPU_PWR_CNTL_EN on
CPU_A78_0 MIN_FREQ 1036800
CPU_A78_0 MAX_FREQ 1036800
CPU_A78_1 MIN_FREQ 1036800
CPU_A78_1 MAX_FREQ 1036800
CPU_A78_2 MIN_FREQ 1036800
CPU_A78_2 MAX_FREQ 1036800
GPU MIN_FREQ 726750000
GPU MAX_FREQ 726750000
GPU_POWER_CONTROL_DISABLE GPU_PWR_CNTL_DIS auto
DLA0_CORE MAX_FREQ 0
DLA1_CORE MAX_FREQ 0
DLA0_FALCON MAX_FREQ -1
DLA1_FALCON MAX_FREQ -1
PVA0_VPS MAX_FREQ -1
PVA0_AXI MAX_FREQ -1
EMC MAX_FREQ 2133000000
# mandatory section to configure the default power mode
< PM_CONFIG DEFAULT=0 >
@nv-ananjappa
In summary, it looks like we have reproduced the expected performance according to the submitted configuration (285 QPS):
@ConfigRegistry.register(HarnessType.Custom, AccuracyTarget.k_99, PowerSetting.MaxQ)
class Orin_MaxQ(Orin):
soc_cpu_freq = 1036800
soc_gpu_freq = 726750000
soc_dla_freq = 0
soc_emc_freq = 2133000000
orin_num_cores = 4
offline_expected_qps = 280
but it's NOT the same configuration that NVIDIA used for the submitted experiment (394 QPS).
I've built two TensorRT engine plans for BERT Offline as follows:
(The files have practically the same size but are slightly different perhaps due to included timestamps?)
I've validated that if I run either plan with:
the performance is up to 489 samples/second and the power drawn is around 78 W (as observed on the front panel of the attached WT310E power meter). (To validate this for the
*.MaxQ.plan
file in the same setting, I copied it over the*.MaxP.plan
file and re-run the experiment.)