Wide variation in test performance

I have ran mlperf inference result NVIDIA code successfly, but I found the performance of my server was much worse than the config parameters. My GPU is tesla T4, single GPU. The config parameters in ./configs/ssd-resnet34/Server/config.json as bellow https://github.com/mlcommons/inference_results_v1.0/blob/master/closed/NVIDIA/configs/ssd-resnet34/Server/config.json#L267

 "T4x1": {
        "config_ver": {
            "triton": {
                "instance_group_count": 4,
                "server_target_qps": 110,
                "use_triton": true
            }
        },
        "deque_timeout_usec": 2000,
        "gpu_batch_size": 2,
        "gpu_inference_streams": 4,
        "server_target_qps": 110,
        "use_cuda_thread_per_device": false
    },

target qps is 110, but when I used this parameter, most of the query latency were timeout, and the actual latency is four orders of magnitude more than target latency. Why is this? here is my result

SUT name : LWIS_Server
Scenario : Server
Mode     : PerformanceOnly
Scheduled samples per second : 110.25
Result is : INVALID
  Performance constraints satisfied : NO
  Min duration satisfied : Yes
  Min queries satisfied : Yes
Recommendations:
 * Reduce target QPS to improve latency.

================================================
Additional Stats
================================================
Completed samples per second    : 23.95

Min latency (ns)                : 11086820
Max latency (ns)                : 8836518915455
Mean latency (ns)               : 4472554615854
50.00 percentile latency (ns)   : 4491175020756
90.00 percentile latency (ns)   : 7986897806512
95.00 percentile latency (ns)   : 8395938850230
97.00 percentile latency (ns)   : 8573065539070
99.00 percentile latency (ns)   : 8748750463855
99.90 percentile latency (ns)   : 8828108821166

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 110
target_latency (ns): 100000000
max_async_queries : 0
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 270336
max_query_count : 0
qsl_rng_seed : 7322528924094909334
sample_index_rng_seed : 1570999273408051088
schedule_rng_seed : 3507442325620259414
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 64

No warnings encountered during test.

No errors encountered during test.
Finished running actual test.
Device Device:0 processed:
  10 batches of size 1
  135163 batches of size 2
  Memcpy Calls: 0
  PerSampleCudaMemcpy Calls: 266236
  BatchedCudaMemcpy Calls: 2055
&&&& PASSED Default_Harness # ./build/bin/harness_default
[2021-07-07 13:40:27,307 main.py:280 INFO] Result: result_scheduled_samples_per_sec: 110.252, Result is INVALID

Does hardware other than the gpu have a significant impact on the test results？if so please let me know，thanks.

mlcommons / inference_results_v1.0

Wide variation in test performance #9