triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.11k stars 1.46k forks source link

Uneven QPS leads to low throughput and high latency as well as low GPU utilization #7318

Open SunnyGhj opened 3 months ago

SunnyGhj commented 3 months ago

Description We found that the performance of triton+tensorrt under stable QPS and uneven QPS is very different. As follows:

As above, there is a significant difference in the 99th percentile between unevan QPS and stable QPS scenarios. I want to know what causes this.

Triton Information nvcr.io/nvidia/tritonserver:23.06-py3

Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:23.06-py3

SunnyGhj commented 3 months ago

@tanmayv25 @Tabrizian

tanmayv25 commented 3 months ago

@SunnyGhj Are you using dynamic batching?

From what you have shown main component for the latency in uneven QPS seems to be the queue time. It can be that requests are waiting for longer time to be batched to get executed on the model.

The uneven QPS seems to be much lesser in magnitude than the stable QPS which makes me believe that it is not an issue with the instance count.

SunnyGhj commented 3 months ago

@SunnyGhj Are you using dynamic batching?

From what you have shown main component for the latency in uneven QPS seems to be the queue time. It can be that requests are waiting for longer time to be batched to get executed on the model.

The uneven QPS seems to be much lesser in magnitude than the stable QPS which makes me believe that it is not an issue with the instance count.

Yes, We using dynamic batching, the config.pbtxt as below:

platform: "tensorrt_plan"
version_policy: { latest: { num_versions: 2}}
parameters: { key: "execution_mode" value: { string_value: "0" } }
parameters: { key: "intra_op_thread_count" value: { string_value: "4" } }
parameters: { key: "inter_op_thread_count" value: { string_value: "4" } }
max_batch_size: 100
instance_group [ { count: 4 }]
input [
  {
      name: "token_type_ids"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "attention_mask"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "input_ids"
      data_type: TYPE_INT32
      dims: [128]
  }
]

output [
  {
      name: "logits"
      data_type: TYPE_FP32
      dims: [5]
  }
]

optimization {
  graph: {level: 1}
  cuda: {
     graphs: false
     busy_wait_events: true
     output_copy_stream: false
     graph_spec: [
      {
        batch_size: 25,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      },
      {
        batch_size: 50,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      },
      {
        batch_size: 100,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      }
    ]
  }
  eager_batching: true
}
dynamic_batching {
  preferred_batch_size: [ 25, 50, 100 ]
  max_queue_delay_microseconds: 10000
}

I don't think it's a problem with dynamic batch because we configured multiple preferred_batch_sizes and max_queue_delay_microseconds. It's true that it's not a matter of instance count, because we increased the instance count, but it didn't work.

tanmayv25 commented 3 months ago

dynamic_batching { preferred_batch_size: [ 25, 50, 100 ] max_queue_delay_microseconds: 10000 }

The above specification might exactly be the culprit and why we see large queue times. Given the QPS is varying between 25-40 quite erratically, there is a high chance that requests are blocked in the queue for upto 10ms till they are scheduled for execution. The above fields should only be specified in case the infer runtime of the batch size 25, 50 and 100 is significantly smaller than the rest of the batch sizes.

The below setting could be best for your case:

dynamic_batching { }

statiraju commented 3 months ago

@SunnyGhj did the suggestion help?

SunnyGhj commented 3 months ago

dynamic_batching { preferred_batch_size: [ 25, 50, 100 ] max_queue_delay_microseconds: 10000 }

The above specification might exactly be the culprit and why we see large queue times. Given the QPS is varying between 25-40 quite erratically, there is a high chance that requests are blocked in the queue for upto 10ms till they are scheduled for execution. The above fields should only be specified in case the infer runtime of the batch size 25, 50 and 100 is significantly smaller than the rest of the batch sizes.

The below setting could be best for your case:

dynamic_batching { }

I modified config.pbtxt as bellow:

platform: "tensorrt_plan"
version_policy: { latest: { num_versions: 2}}
parameters: { key: "execution_mode" value: { string_value: "0" } }
parameters: { key: "intra_op_thread_count" value: { string_value: "4" } }
parameters: { key: "inter_op_thread_count" value: { string_value: "4" } }
max_batch_size: 25
instance_group [ { count: 3 }]
input [
  {
      name: "token_type_ids"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "attention_mask"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "input_ids"
      data_type: TYPE_INT32
      dims: [128]
  }
]

output [
  {
      name: "logits"
      data_type: TYPE_FP32
      dims: [5]
  }
]

optimization {
  graph: {level: 1}
  cuda: {
     graphs: true
     busy_wait_events: true
     output_copy_stream: false
     graph_spec: [
      {
        batch_size: 25,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      }
    ]
  }
  eager_batching: true
}
dynamic_batching {
}

Doesn't seem to work.

SunnyGhj commented 3 months ago

@SunnyGhj did the suggestion help? No, it didn't solve the problem.

SunnyGhj commented 3 months ago

Is there any other suggestions? @tanmayv25 @statiraju

tanmayv25 commented 3 months ago

@SunnyGhj Can you try the following config and share your findings?

platform: "tensorrt_plan"
version_policy: { latest: { num_versions: 2}}
max_batch_size: 100
instance_group [ { count: 100 }]
input [
  {
      name: "token_type_ids"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "attention_mask"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "input_ids"
      data_type: TYPE_INT32
      dims: [128]
  }
]

output [
  {
      name: "logits"
      data_type: TYPE_FP32
      dims: [5]
  }
]

optimization {
  graph: {level: 1}
  cuda: {
     graphs: true
     busy_wait_events: true
     output_copy_stream: false
     graph_spec: [
      {
        batch_size: 25,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      }
    ]
  }
}
dynamic_batching {
}
SunnyGhj commented 3 months ago

@SunnyGhj Can you try the following config and share your findings?

platform: "tensorrt_plan"
version_policy: { latest: { num_versions: 2}}
max_batch_size: 100
instance_group [ { count: 100 }]
input [
  {
      name: "token_type_ids"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "attention_mask"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "input_ids"
      data_type: TYPE_INT32
      dims: [128]
  }
]

output [
  {
      name: "logits"
      data_type: TYPE_FP32
      dims: [5]
  }
]

optimization {
  graph: {level: 1}
  cuda: {
     graphs: true
     busy_wait_events: true
     output_copy_stream: false
     graph_spec: [
      {
        batch_size: 25,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      }
    ]
  }
}
dynamic_batching {
}

Sure. However, due to GPU memory limitations, we support a maximum of 40 instances. The config as following:

platform: "tensorrt_plan"
version_policy: { latest: { num_versions: 2}}
max_batch_size: 100
instance_group [ { count: 40 }]
input [
  {
      name: "token_type_ids"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "attention_mask"
      data_type: TYPE_INT32
      dims: [128]
  },
  {
      name: "input_ids"
      data_type: TYPE_INT32
      dims: [128]
  }
]

output [
  {
      name: "logits"
      data_type: TYPE_FP32
      dims: [5]
  }
]

optimization {
  graph: {level: 1}
  cuda: {
     graphs: true
     busy_wait_events: true
     output_copy_stream: false
     graph_spec: [
      {
        batch_size: 25,
        input: { key: "token_type_ids", value: { dim: [128] } },
        input: { key: "attention_mask", value: { dim: [128] } },
        input: { key: "input_ids", value: { dim: [128] } }
      }
    ]
  }
}
dynamic_batching {
}

The stress test result as following:

  1. QPS image
  2. nv_inference_request_summary_us、nv_inference_queue_summary_us、nv_inference_compute_infer_summary_us image
  3. nv_gpu_utilization image

I noticed that (1) when qps is less than instance count, queue time is short (2) output time becomes very long.

SunnyGhj commented 3 months ago

image I noticed that, in the NVTX Summary, ProcessResponse cost long time, this seems unreasonable.

SunnyGhj commented 3 months ago

@tanmayv25 Sorry to bother you, Is there any further suggestions?

tanmayv25 commented 3 months ago

The ProcessResponse waits for the TRT inference run to complete and produce output results. See here. So basically, we for 40 instance counts, we will have one single issue thread that pick requests from the queue and sends it to the TRT engine for inference. It will make sure that at a given time there are only 40 concurrent requests on the TRT engine.

We will also have 40 ProcessResponse threads that are responsible for retrieving the results from the inflight inference requests and send it to the client. Every thing builds on each other. Assuming low QPS (less than the instance counts), there will be a number of ProcessResponse threads waiting to get the request that they should snoop upon which might appear as a large ProcessResponse. Basically, waiting at this point. Hence, I don't find it alarming.

(1) when qps is less than instance count, queue time is short

This conforms with my expectations. There is no reason for the request to be blocked in the queue because the issue thread would pick the request from the queue right away and send it to the TRT execution engine.

When the QPS is high, it is then the issue thread's logic of restricting the execution concurrency of upto instance_count kicks in.

However, with this restriction set in place I would still expect a consistent GPU utilization for an uneven QPS which can keep a consistent load and requests available for execution.

Can you share the inference_statistics output after you run your tests? I am mostly interested in the batch_stats which would explain the diversity in the batch_sizes and the execution time for each.

output time becomes very long.

Can you elaborate on this? Additionally, do you still observe low throughput and high latency when specifying the dynamic batching as:

dynamic_batching {
}

and a higher instance count? Did it help in anyway? Bringing down average latency or improving the throughput?

I would suggest you to try using model_analyzer to get the most appropriate value of instance_count.