triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Question about custom batching library #6458

Open tsinggggg opened 11 months ago

tsinggggg commented 11 months ago

Hi Maintainers,

Thanks for the great work. This is not a bug report probably, but more of a question.

I am following https://github.com/triton-inference-server/backend/tree/main/examples#custom-batching to build the single batching shared object file(https://github.com/triton-inference-server/backend/tree/main/examples/batching_strategies/single_batching) on MacOS.

Can I do cmake -DTRITON_ENABLE_GPU=OFF -DCMAKE_INSTALL_PREFIX:PATH=pwd/install .. to turn off GPU support? In my use case we don't need gpu.

Then make install gives me the error below

[  2%] Building CXX object _deps/repo-common-build/src/CMakeFiles/triton-common-async-work-queue.dir/async_work_queue.cc.o
[  5%] Building CXX object _deps/repo-common-build/src/CMakeFiles/triton-common-async-work-queue.dir/error.cc.o
[  8%] Building CXX object _deps/repo-common-build/src/CMakeFiles/triton-common-async-work-queue.dir/thread_pool.cc.o
[ 11%] Linking CXX static library libtritonasyncworkqueue.a
[ 11%] Built target triton-common-async-work-queue
[ 14%] Building CXX object _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_common.cc.o
[ 17%] Building CXX object _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_input_collector.cc.o
[ 20%] Building CXX object _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_memory.cc.o
In file included from /Users/chengqian/work/custom_batching/build/_deps/repo-backend-src/src/backend_memory.cc:27:
In file included from /Users/chengqian/work/custom_batching/build/_deps/repo-backend-src/include/triton/backend/backend_memory.h:28:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/string:504:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/string_view:175:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__string:57:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/algorithm:640:
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/type_traits:1547:38: error: implicit instantiation of
      undefined template 'std::__1::hash<triton::backend::BackendMemory::AllocationType>'
    : public integral_constant<bool, __is_empty(_Tp)> {};
                                     ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/unordered_map:425:18: note: in instantiation of template
      class 'std::__1::is_empty<std::__1::hash<triton::backend::BackendMemory::AllocationType> >' requested here
          bool = is_empty<_Hash>::value && !__libcpp_is_final<_Hash>::value>
                 ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/unordered_map:860:13: note: in instantiation of default
      argument for '__unordered_map_hasher<triton::backend::BackendMemory::AllocationType,
      std::__1::__hash_value_type<triton::backend::BackendMemory::AllocationType, TRITONSERVER_Error *>,
      std::__1::hash<triton::backend::BackendMemory::AllocationType> >' required here
    typedef __unordered_map_hasher<key_type, __value_type, hasher>   __hasher;
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/chengqian/work/custom_batching/build/_deps/repo-backend-src/src/backend_memory.cc:118:59: note: in instantiation
      of template class 'std::__1::unordered_map<triton::backend::BackendMemory::AllocationType, TRITONSERVER_Error *,
      std::__1::hash<triton::backend::BackendMemory::AllocationType>,
      std::__1::equal_to<triton::backend::BackendMemory::AllocationType>, std::__1::allocator<std::__1::pair<const
      triton::backend::BackendMemory::AllocationType, TRITONSERVER_Error *> > >' requested here
  std::unordered_map<AllocationType, TRITONSERVER_Error*> errors;
                                                          ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/type_traits:428:50: note: template is declared here
template <class _Tp> struct _LIBCPP_TEMPLATE_VIS hash;
                                                 ^
1 error generated.
make[2]: *** [_deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_memory.cc.o] Error 1
make[1]: *** [_deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/all] Error 2
make: *** [all] Error 2

Regards

tsinggggg commented 11 months ago

A follow up question: is the example of single_batching thread safe? would it be possible that different threads are checking if the batch is empty at the same time, and they try to add requests to that batch, so that it ends up with more than 1 requests in the batch

dyastremsky commented 11 months ago

In theory, you should be able to. I do not know if you need that flag, but it looks like the error is actually coming from the common repo used by the backend repo.

If you use build.py and don't enable GPU, it does something similar. It's worth looking at that file, since that's the generally recommended and tested route. Note that there is no official Mac support, so we do not build on or test for it.

There is only one scheduling thread per model instance (with its own custom batcher) to avoid any race conditions, as far as I know.

tsinggggg commented 11 months ago

Thanks @dyastremsky , it seems that sometimes the official example https://github.com/triton-inference-server/backend/tree/main/examples/batching_strategies/single_batching is not working as expected

Description With the official single_batching custom plugin, sometimes I see batches with more than 1 requests

Triton Information

23.09(using container)

To Reproduce

  1. prepare the libtriton_singlebatching.so from https://github.com/triton-inference-server/backend/tree/main/examples/batching_strategies/single_batching following steps in https://github.com/triton-inference-server/backend/tree/main/examples#custom-batching

  2. prepare a dummy python backend model:

config.pbtxt

name: "model"
backend: "python"
max_batch_size: 32
dynamic_batching { }
parameters: { key: "TRITON_BATCH_STRATEGY_PATH", value: {string_value: "/custom_batching/libtriton_singlebatching.so"}}

input [{
        name: "TEXT"
        data_type: TYPE_STRING
        dims: [-1]
    }
]

output [{
        name: "INTENT_NAMES"
        data_type: TYPE_STRING
        dims: [-1]
    }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

model.py

from typing import Dict, List

import numpy as np
import time
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

    def initialize(self, args: Dict[str, str]) -> None:
        self.logger = pb_utils.Logger

    def execute(self, requests) -> "List[List[pb_utils.Tensor]]":
        responses = []
        if len(requests)>1:
            self.logger.log_info(f"Number of requests: {len(requests)}")
        for _ in requests:
            time.sleep(0.1)
            outputs = []
            tensor = pb_utils.Tensor("INTENT_NAMES", np.array(['hello', 'bye'], dtype=object))
            outputs.append(tensor)
            inference_response = pb_utils.InferenceResponse(output_tensors=outputs)
            responses.append(inference_response)

        return responses
  1. start triton
docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/to/libtriton_singlebatching.so:/custom_batching -v /path/to/dummy/model:/models nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver --model-repository=/models --log-verbose=1
  1. send concurrent requests using https://github.com/bojand/ghz
    ghz --insecure --proto grpc_service.proto --call inference.GRPCInferenceService.ModelInfer -D grpc-triton-batch.json localhost:8001 -c 50 -n 1000 -t 10s

grpc_service.proto is from the link below

https://github.com/triton-inference-server/common/blob/cf617c93c75d8e03da962115a2471e4b89062aae/protobuf/grpc_service.proto

the grpc-triton-batch.json contains

{
    "model_name": "model",
    "model_version": "1",
    "inputs": [{
        "name": "TEXT",
        "shape": [1, 1],
        "datatype": "BYTES",
        "contents": {
            "bytes_contents": ""
        }
    }]
}
  1. results

even though the custom plugin is identified

I1026 02:06:17.178926 1 model_config_utils.cc:680] Server side auto-completed config: name: "model"
max_batch_size: 32
input {
  name: "TEXT"
  data_type: TYPE_STRING
  dims: -1
}
output {
  name: "INTENT_NAMES"
  data_type: TYPE_STRING
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
dynamic_batching {
}
parameters {
  key: "TRITON_BATCH_STRATEGY_PATH"
  value {
    string_value: "/custom_batching/libtriton_singlebatching.so"
  }
}
backend: "python"

I saw some logs such as below, indicating a batch with more than 1 requests

I1026 02:06:22.401634 1 python_be.cc:1273] model model, instance model_0_0, executing 2 requests
I1026 02:06:22.404210 1 model.py:21] Number of requests: 2

Expected behavior all batches should have 1 request

dyastremsky commented 11 months ago

Thank you for sharing how to reproduce your results. Can you please load your full logs? Verbose, if possible.

tsinggggg commented 11 months ago

single_batch_log.txt

@dyastremsky thanks for your response, please see attachment for the verbose logs

dyastremsky commented 11 months ago

Thank you! Looks like it was loaded correctly. I created a ticket for us to investigate.

Ticket reference number: DLIS-5718.