triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

No response is received during inference in decoupled mode. #169

Open amazingkmy opened 11 months ago

amazingkmy commented 11 months ago

Description

main branch
V100

my model type is GPTNeoX

Reproduced Steps

https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/gptneox_guide.md#decoupled-mode

I ran tritonserver using my model after seeing this article. i changed request_output_len to 512 and sent a request.
No response came after 130-140 tokens were generated, the GPU is still using memory.

I set FT_LOG_LEVEL to DEBUG and did the test again.
No error logs were found while tritonserver was starting.

Error logs were found while generating a response.

[FT][DEBUG] bool fastertransformer::TensorMap::isExist(const string&) const for key: ia3_tasks
[FT][DEBUG] T* fastertransformer::Tensor::getPtr() const [with T = const int] start
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, int, const void*, int, void*, int)
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, int, const void*, int, void*, int, float, float)
[FT][DEBUG] T* fastertransformer::Tensor::getPtr() const [with T = __half] start
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cudaDataType_t, cublasGemmAlgo_t)
[FT][DEBUG] static std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char>, triton::Tensor> > GptNeoXTritonModelInstance<T>::convert_outputs(const std::unordered_map<std::__cxx11::basic_string<char>, fastertransformer::Tensor>&) [with T = __half]
W0926 04:57:16.322694 353779 libfastertransformer.cc:1182] response is nullptr

This is the result of running issue_request.py
[After 8.39s] Partial result (probability 1.00e+00):
[     5  16162   1079     28    201  89910  41589   3222  33368   2884
   4599      0    201      5  10933  16350     28    201  89910    769
     10  28080  18414    253  72990     14    223    379    409    813
    978   1000    223    392    662   1752    979 100805  93716    475
   1779     11    529   6767    862   4640  67458   2814  79524   3671
   6767    635   1397   7175  22735  16192   2923  36295   1276     16
    565    720   1582   3613  28922  16192   2923  12317    223    379
    455   5052  22109   4683  21622    487  45312    505  65355  29250
   2047   7175    885  48088  17994   2450    713    637  26029  26914
    476  31803  25949     16  33012   6902  16867  10587   2217   1301
  26977    654  32299    873     16   3561   6093    529   1997  14498
  94855   6814   1071  26776   4173   3191     16    223    386  12586
   1031  36099   3009    639    554    885   1183  12495   1694  11948
     14    223    391    815    931    588    223    379    456   6750
    223    379    461    813    792    813    990    813    600    366
     26   5052   1445    936  17791   2490  15822   7055   1789  87914
   1545      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0]

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/fastertransformer_backend/tools/issue_request.py", line 112, in stream_consumer
    result = queue.get()
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
TypeError: InferenceServerException.__init__() missing 1 required positional argument: 'msg'

Please review this issue @byshiue