triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.25k stars 1.47k forks source link

Out of order on processing sequence #3406

Closed warlock135 closed 3 years ago

warlock135 commented 3 years ago

Description I try to develop Kaldi-asr backend/ client for new triton version (21.07) based on the old one (here)

Sometimes, requests fail with the following error:

error: Inference failed: inference request for sequence 1983 to model 'kaldi_online' must specify the START flag on the first request of the sequence

Turning on the server verbose log, I figure out that the request with START flag was processed AFTER the one with END flag.

I0927 02:36:20.422575 24 infer_request.cc:524] prepared: [0x0x7f66cf3fda90] request id: , model: kaldi_online, requested version: -1, actual version: 1, flags: 0x2, correlation id: 1983, batch size: 1, priority: 0, timeout (us): 0

I0927 02:36:20.422601 24 grpc_server.cc:3257] Infer failed: inference request for sequence 1983 to model 'kaldi_online' must specify the START flag on the first request of the sequence

I0927 02:36:21.946447 24 infer_request.cc:524] prepared: [0x0x7f661bdfc710] request id: , model: kaldi_online, requested version: -1, actual version: 1, flags: 0x1, correlation id: 1983, batch size: 1, priority: 0, timeout (us): 0

Client requests are sent in the right order:

  options.sequence_id_ = corr_id;
  if (start_of_sequence)
    options.sequence_start_ = true;
  if (end_of_sequence) {
    options.sequence_end_ = true;
  double start = gettime_monotonic();
  if (start_of_sequence) {
    std::cout << std::setprecision(15) << "At: " << start / 1000
              << " ** Context: " << context.get()
              << " ## Send first chunk for corr_id: " << corr_id << std::endl;
  }
  if (end_of_sequence) {
    std::cout << std::setprecision(15) << "At: " << start / 1000
              << " ** Context: " << context.get()
              << " ## Send last  chunk for corr_id: " << corr_id << std::endl;
  }
  FAIL_IF_ERR(
      context->AsyncInfer(
          [corr_id, end_of_sequence, start, file_name,
           this](tc::InferResult* result) {
            //Callback         
          },
          options, inputs, outputs),
      "unable to run model");

Log printed from code above:

At: 5700.15383176291 Context: 0x565363d29810 ## Send first chunk for corr_id: 1983 At: 5700.15392759459 Context: 0x565363d29810 ## Send last chunk for corr_id: 1983

Triton Information What version of Triton are you using? r21.07

Are you using the Triton container or did you build it yourself? container

To Reproduce Steps to reproduce the behavior.

Triton server launches command :

nvidia-docker run --rm -it \
   --shm-size=1g \
   --ulimit memlock=-1 \
   --ulimit stack=67108864 \
   -p8000:8000 \
   -p8001:8001 \
   -p8002:8002 \
   --name trt_server_asr \
   -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
   -v $PWD/data:/data \
   -v $PWD/model-repo:/mnt/model-repo \
   triton_kaldi_server:21.07 tritonserver --model-repo=/workspace/model-repo/ --model-control-mode=poll --repository-poll-secs=10  --log-verbose 1

Expected behavior Requests are processed in the order they were sent

GuanLuo commented 3 years ago

Can you try to use the GRPC client's streaming API to send the request? Otherwise you need to make sure the requests arrives to the server in order

warlock135 commented 3 years ago

Sorry for the late reply, I was out of office last week My client use grpc api. Here is code fragment use to create/ get grpc client: std::vector<std::unique_ptr<tc::InferenceServerGrpcClient>> contextes_;

  contextes_.emplace_back();
  std::unique_ptr<tc::InferenceServerGrpcClient>& client = contextes_.back();
  FAIL_IF_ERR(
      tc::InferenceServerGrpcClient::Create(&(client), url_, false),
      "unable to create grpc client");
  std::unique_ptr<tc::InferenceServerGrpcClient>& context =
      contextes_.at(corr_id % ncontextes_);
tanmayv25 commented 3 years ago

@warlock135 How are you sending the inference request? Are you using AsyncStreamInfer as Guan suggested? The example is here.

warlock135 commented 3 years ago

@tanmayv25 After moving from AsyncInfer to AsyncStreamInfer the code works well. Thank you for the support. I will close the issue.