triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.81k stars 1.42k forks source link

about ensemble model with multi GPUs #6981

Open lzcchl opened 4 months ago

lzcchl commented 4 months ago

triton nvcr.io/nvidia/tritonserver 23.12-py3

I have 4 RTX4090 Nvidia graphics cards, and my model is an ensemble model, which can be understood as preprocessing+inference. In config.pbtxt, the gpus parameter is not set in the instance_group. Therefore, by default, the model will have instances on each graphics card. When I run the service, enable nvidia smi to view the parameters, and each graphics card is working.

However, when I was running the inference, a problem occurred. The actual situation was that GPU0 ran the preprocessing part of the ensemble model, and the resulting result was used by GPU1 to run the inference part of the ensemble model, which caused an error. The backend cannot know which GPU the data comes from. How should I solve this problem?

preprocessing is C++ backend write by myself, inference is https://github.com/triton-inference-server/pytorch_backend.

if I set config.pbtxt gpus : [0], the code run very well, but when use multi gpu, it runs bad!

lzcchl commented 4 months ago

After some testing, I found that my code can only execute successfully on GPU0 and will fail on GPU1/2/3, even a single model (such as rgb2bgr) rather than an ensemble. What is the reason for the failure? Here are my code, config.pbtxt file, and client.py. Can you tell me how to modify the code ?

C++ backend code

uint64_t exec_start_ns = 0;
SET_TIMESTAMP(exec_start_ns);

ModelInstanceState* instance_state;
RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(
    instance, reinterpret_cast<void**>(&instance_state)));
ModelState* model_state = instance_state->StateForModel();

LOG_MESSAGE(TRITONSERVER_LOG_INFO, std::string("model " + 
  model_state->Name() + ": start").c_str());

// int kind = instance_state->Kind();
// auto name = instance_state->Name();
int deviceId = instance_state->DeviceId();

std::vector<TRITONBACKEND_Response*> responses;
responses.reserve(request_count);
for (uint32_t r = 0; r < request_count; ++r) {
  TRITONBACKEND_Request* request = requests[r];
  TRITONBACKEND_Response* response;
  RETURN_IF_ERROR(TRITONBACKEND_ResponseNew(&response, request));
  responses.push_back(response);
}

BackendInputCollector collector(
    requests, request_count, &responses, model_state->TritonMemoryManager(),
    model_state->EnablePinnedInput() /* pinned_enabled */, instance_state->CudaStream() /* stream*/);

std::vector<std::pair<TRITONSERVER_MemoryType, int64_t>> allowed_input_types =
    {{TRITONSERVER_MEMORY_GPU, deviceId}};

const char* input_buffer;
size_t input_buffer_byte_size;
TRITONSERVER_MemoryType input_buffer_memory_type;
int64_t input_buffer_memory_type_id;//dev id

RESPOND_ALL_AND_SET_NULL_IF_ERROR(
    responses, request_count,
    collector.ProcessTensor(
        model_state->InputTensorName().c_str(), nullptr /* existing_buffer */,
        0 /* existing_buffer_byte_size */, allowed_input_types, &input_buffer,
        &input_buffer_byte_size, &input_buffer_memory_type,
        &input_buffer_memory_type_id));

const bool need_cuda_input_sync = collector.Finalize();
if (need_cuda_input_sync) {
  cudaStreamSynchronize(instance_state->CudaStream());
  // LOG_MESSAGE(
  //     TRITONSERVER_LOG_ERROR,
  //     "'recommended' backend: unexpected CUDA sync required by collector");
}

uint64_t compute_start_ns = 0;
SET_TIMESTAMP(compute_start_ns);

LOG_MESSAGE(
    TRITONSERVER_LOG_INFO,
    (std::string("model ") + model_state->Name() + ": requests in batch " +
    std::to_string(request_count))
        .c_str());

bool supports_first_dim_batching;
RESPOND_ALL_AND_SET_NULL_IF_ERROR(
    responses, request_count,
    model_state->SupportsFirstDimBatching(&supports_first_dim_batching));

size_t total_batch_size = 0;
if (!supports_first_dim_batching) {
  total_batch_size = request_count;
} 
else {
  for (uint32_t r = 0; r < request_count; ++r) {
    auto& request = requests[r];
    TRITONBACKEND_Input* input = nullptr;
    LOG_IF_ERROR(
        TRITONBACKEND_RequestInputByIndex(request, 0 /* index */, &input),
        "failed getting request input");
    if (input != nullptr) {
      const int64_t* shape = nullptr;
      LOG_IF_ERROR(
          TRITONBACKEND_InputProperties(
              input, nullptr, nullptr, &shape, nullptr, nullptr, nullptr),
          "failed getting input properties");
      if (shape != nullptr) {
        total_batch_size += shape[0];
      }
    }
  }
}
// std::cout << "total_batch_size: " << total_batch_size << std::endl;

//do proprecess
//here get gpu ptr

const char* output_buffer = (const char*)gpu_ptr_after_proprecess;
TRITONSERVER_MemoryType output_buffer_memory_type = input_buffer_memory_type;
int64_t output_buffer_memory_type_id = input_buffer_memory_type_id;

uint64_t compute_end_ns = 0;
SET_TIMESTAMP(compute_end_ns);

std::vector<int64_t> tensor_shape;
RESPOND_ALL_AND_SET_NULL_IF_ERROR(
    responses, request_count, model_state->TensorShape(tensor_shape));

BackendOutputResponder responder(
    requests, request_count, &responses, model_state->TritonMemoryManager(),
    supports_first_dim_batching, model_state->EnablePinnedOutput() /* pinned_enabled */,
    instance_state->CudaStream() /* stream*/);

responder.ProcessTensor(
    model_state->OutputTensorName().c_str(), model_state->TensorDataType(),
    tensor_shape, output_buffer, output_buffer_memory_type,
    output_buffer_memory_type_id);

const bool need_cuda_output_sync = responder.Finalize();
if (need_cuda_output_sync) {
  cudaStreamSynchronize(instance_state->CudaStream());
  // LOG_MESSAGE(
  //     TRITONSERVER_LOG_ERROR,
  //     "'recommended' backend: unexpected CUDA sync required by responder");
}

for (auto& response : responses) {
  if (response != nullptr) {
    LOG_IF_ERROR(
        TRITONBACKEND_ResponseSend(
            response, TRITONSERVER_RESPONSE_COMPLETE_FINAL, nullptr),
        "failed to send response");
  }
}

uint64_t exec_end_ns = 0;
SET_TIMESTAMP(exec_end_ns);

========================================================================

and config.pbtxt

  backend: "cudargb2bgr"
  max_batch_size: 32
  input [
    {
      name: "input_tensors"
      data_type: TYPE_UINT8
      dims: [640, 640, 3]
    }
  ]

  output [
    {
      name: "output_tensors"
      data_type: TYPE_UINT8
      dims: [640, 640, 3]
    }
  ]

  instance_group [
    {
      count: 2
      kind: KIND_GPU
      gpus: [1]
    }
  ]

  dynamic_batching {
    preferred_batch_size: [2, 4, 8, 16, 32]
    max_queue_delay_microseconds: 100000
  }

  model_warmup [
    {
      batch_size: 1
      name: "warmup_requests"
      inputs {
        key: "input_tensors"
        value: {
          random_data: true
          dims: [640, 640, 3]
          data_type: TYPE_UINT8
        }
      }
    }
  ]

========================================================================

and client.py

import argparse
import sys, os

import numpy as np
import tritonclient.http as httpclient
import cv2

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-u",
        "--url",
        type=str,
        required=False,
        default="localhost:8000",
        help="Inference server URL. Default is localhost:8000.",
    )
    FLAGS = parser.parse_args()

    mat_comp_list = []
    mat_input_list = []
    imgs_dir = r'/home/lzc/work/Data/imgs/dogCat'
    for file_name in os.listdir(imgs_dir):
        if file_name.endswith('.jpg'):
            file_path = os.path.join(imgs_dir, file_name)
            mat = cv2.imread(file_path)
            mat_resz = cv2.resize(mat, [640, 640])
            mat_input_list.append(mat_resz)

            #cv2.imshow(file_name, mat_resz)
            mat_resz_rgb = cv2.cvtColor(mat_resz, cv2.COLOR_BGR2RGB)
            mat_comp_list.append(mat_resz_rgb)
            # cv2.imshow(file_name + '0', mat_resz_rgb)

    try:
        concurrent_request_count = 8
        triton_client = httpclient.InferenceServerClient(
            url=FLAGS.url, concurrency=concurrent_request_count
        )
    except Exception as e:
        print("channel creation failed: " + str(e))
        sys.exit(1)

    print("\n=========")
    async_requests = []

    # '''
    for mat_input in mat_input_list:
        mat_input_expand = np.expand_dims(mat_input, axis=0)
        inputs = [httpclient.InferInput("input_tensors", [1, 640, 640, 3], "UINT8")]
        inputs[0].set_data_from_numpy(mat_input_expand)
        async_requests.append(triton_client.async_infer("rgb2bgr_640", inputs))
    # '''

    idx = 0
    for async_request in async_requests:
        # Get the result from the initiated asynchronous inference
        # request. This call will block till the server responds.
        result = async_request.get_result()
        print("Response: {}".format(result.get_response()))
        # print("OUTPUT = {}".format(result.as_numpy("output_tensors")))
        batch_tensors = result.as_numpy("output_tensors")
        for i in range(batch_tensors.shape[0]):
            tensors_one = batch_tensors[i]

            equal = np.array_equal(tensors_one, mat_comp_list[idx])
            print('equal: {0}'.format(equal))

            #cv2.imshow(str(idx), tensors_one)
            idx = idx + 1
    #cv2.waitKey(0)
    print('pause')

========================================================================

lkomali commented 4 months ago

cc: @GuanLuo Any thoughts?

lzcchl commented 4 months ago

After some debugging, I found that the "input_buffer" obtained from the "collector.ProcessTensor" function is no longer my input data. I am puzzled as to why "input_buffer" is normal at GPU0, while GPU1/2/3 obtains incorrect data?

Furthermore, how should I obtain the normal "input_buffer" when using GPU1/2/3?

lzcchl commented 4 months ago

I'm really sorry, I found one of my bugs, when I do proprecess, there are coda likes that cudaGetDevice(&dev_); and cudaSetDevice(dev_), the value of dev_ is always 0, which make my cuda kernel always run on GPU0, so after change the code, my pipeline can work well on GPU0/1/2/3.

lzcchl commented 4 months ago

However, the problem of using multiple GPUs together still exists, for example, my proprecess is rgb2bgr + nhwc2nchw, when I set the same instance_group gpus in config.pbtxt, rgb2bgr + nhwc2nchw runs well, but when I set rgb2bgr instance_group gpus: [0], nhwc2nchw gpus: [1], it work bad, It seem that data moving on different devices is bad, I will debug for more...

GuanLuo commented 4 months ago

Moving data across devices does introduce larger overhead and thus I would suggest you to put your pipeline to be on the same GPU to avoid overhead of moving data across devices.

lzcchl commented 4 months ago

@GuanLuo , thanks for reply!

I know it does larger overhead, but I have 4 GPUs, what should I do to maximize the use of my hardware?

In my opinion, ensemble model pipeline should run on the same device, but in fact, it seems not in triton server?

is there any parameter should be set in config.pbtxt to control a ensemble model run the same device?

I think it should be support by triton, for example, instance_group gpus: [0,1,2,3] be set in config.pbtxt, the pipeline of ensemble model should run as preprocessing(GPU0)--->inference(GPU0), preprocessing(GPU1)--->inference(GPU1), preprocessing(GPU2)--->inference(GPU2), preprocessing(GPU3)--->inference(GPU3). but in fact, preprocessing(GPU0)--->inference(GPU1), ..0to2,0to3,1to0,1to2........ is also exist.

Can you give me some advice for my future works?