T4 setting to achieve maximum performance

vilmara commented 4 years ago

hi @nvpohanh, could you please confirm if there are extra settings to those shown below to get maximum performance on T4 GPU with MLPerf inference benchmarks?

Set Transparent Huge Pages (THP) to always (for Server scenario)
Turn off the ECC RAM
Set GPU Max Clock and Memory Clock rate
Set Cooling efficiency

nvpohanh commented 4 years ago

Yes, please let me know if you still don't get the reported performance numbers.

vilmara commented 4 years ago

Hi @nvpohanh, I am still getting performance issues, please see below:

on RTX6000: Mobilenet | Server Scenario: ~43,594 img/sec ( Nvidia has reported ~47,775-49,775 img/sec on RTX8000)

on T4: Resnet-50- | Server scenario: ~4,782 img/sec (Nvidia has reported ~5,193 img/sec)

nvpohanh commented 4 years ago

It is expected that RTX6000/8000 is slightly slower than TitanRTX. To track down, could you share which clock frequency and power level that the GPU stabilizes at during the inference? You can monitor that by running nvidia-smi dmon -s pc concurrently with the harness.

About T4, it would be super helpful to understand the GPU temperature. You can use the same nvidia-smi dmon -s pc command to monitor that.

vilmara commented 4 years ago

@nvpohanh, I will run the tests again, meanwhile could you please provide the reference MLPerf inference performance numbers for RTX6000?

nvpohanh commented 4 years ago

Unfortunately we didn't submit RTX6000 numbers for MLPerf Inference v0.5.

vilmara commented 4 years ago

Hi @nvpohanh, does Nvidia MLPerf inference scale automatically with RTX8000/6000?. I am running the inference with 3 GPU's; however, seems it is not scaling linearly, see below:

on 1x_RTX6000: Mobilenet | Server Scenario: ~43,594 img/sec

on 3x_RTX6000: Mobilenet | Server Scenario: ~93,724.99 img/sec (expected: 43,594x3= ~130,781 )

vilmara commented 4 years ago

It is expected that RTX6000/8000 is slightly slower than TitanRTX. To track down, could you share which clock frequency and power level that the GPU stabilizes at during the inference? You can monitor that by running nvidia-smi dmon -s pc concurrently with the harness.

The clocks were set up as shown below:

sudo nvidia-smi -ac 6501,1620

vilmara commented 4 years ago

Hi @nvpohanh, does Nvidia MLPerf inference scale automatically with RTX8000/6000?. I am running the inference with 3 GPU's; however, seems it is not scaling linearly, see below:

on 1x_RTX6000: Mobilenet | Server Scenario: ~43,594 img/sec

on 3x_RTX6000: Mobilenet | Server Scenario: ~93,724.99 img/sec (expected: 43,594x3= ~130,781 )

Hi @nvpohanh, does the Nvidia MLPerf inference work with multi-GPU on any system?, or does it work multi-GPU only with the systems used for submission v0.5?

psyhtest commented 4 years ago

I'm running BERT experiments on an AWS G4 g4dn.4xlarge instance using a single T4. The supported clocks are a bit lower than in @vilmara's case:

$ sudo nvidia-smi -q -d SUPPORTED_CLOCKS                                     

==============NVSMI LOG==============

Timestamp                           : Tue Jun 16 16:19:30 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:00:1E.0
    Supported Clocks
        Memory                      : 5001 MHz
            Graphics                : 1590 MHz
            Graphics                : 1575 MHz
...

But at the maximum frequency the GPU is not really stable, dropping down to ~900 MHz in a SingleStream run:

$ sudo nvidia-smi -ac 5001,1590
...
================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : Single Stream
Mode     : Performance
90th percentile latency (ns) : 98016803
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 11.27
QPS w/o loadgen overhead        : 11.27

Min latency (ns)                : 81513330
Max latency (ns)                : 101580923
Mean latency (ns)               : 88718991
50.00 percentile latency (ns)   : 87841138
90.00 percentile latency (ns)   : 98016803
95.00 percentile latency (ns)   : 98051350
97.00 percentile latency (ns)   : 98064832
99.00 percentile latency (ns)   : 98093310
99.90 percentile latency (ns)   : 101015009

It's quite stable at ~800 MHz:

$ sudo nvidia-smi -ac 5001,795
$ python3 run.py --backend=pytorch --scenario=SingleStream
================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : Single Stream
Mode     : Performance
90th percentile latency (ns) : 100668289
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 10.13
QPS w/o loadgen overhead        : 10.13

Min latency (ns)                : 97998490
Max latency (ns)                : 103667665
Mean latency (ns)               : 98672758
50.00 percentile latency (ns)   : 98098138
90.00 percentile latency (ns)   : 100668289
95.00 percentile latency (ns)   : 101665011
97.00 percentile latency (ns)   : 102236103
99.00 percentile latency (ns)   : 102918025
99.90 percentile latency (ns)   : 103634703

but a bit faster at 900 MHz:

$ sudo nvidia-smi -ac 5001,900
$ python3 run.py --backend=pytorch --scenario=SingleStream
================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : Single Stream
Mode     : Performance
90th percentile latency (ns) : 99973436
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 10.57
QPS w/o loadgen overhead        : 10.57

Min latency (ns)                : 88573483
Max latency (ns)                : 109587347
Mean latency (ns)               : 94590988
50.00 percentile latency (ns)   : 94583175
90.00 percentile latency (ns)   : 99973436
95.00 percentile latency (ns)   : 101330092
97.00 percentile latency (ns)   : 102890999
99.00 percentile latency (ns)   : 104760879
99.90 percentile latency (ns)   : 107273294

(900 MHz is "a bit faster" than 800 MHz at the 90th percentile. At the 99th percentile, 900 MHz is actually a bit slower than 800 MHz.)

nvpohanh commented 4 years ago

I have an impression that the T4 on AWS has not-so-good cooling... Could you try nvidia-smi dmon -s pc to see what's the GPU temperature? If it reaches >75C, something needs to be improved.

psyhtest commented 4 years ago

@nvpohanh Yes, that's the case. I suspect this VM instance takes a slice of a larger machine. Perhaps the neighbours are maxing out their GPUs :).

mlcommons / inference_results_v0.5

T4 setting to achieve maximum performance #20