GPU:H100
With latest v0.9.0 code and image from ngc
Who can help?
No response
Information
[ ] The official example scripts
[ ] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
Build llama2 13B engine
Launch triton server in decoupled mode
Use inflight_batcher_llm/client/inflight_batcher_llm_client.py code to test stream mode.
Expected behavior
Normal stream output.
actual behavior
Exception as follows:
0603 11:27:18.631800 29219 pb_stub.cc:751] "Failed to process the request(s) for model 'tensorrt_llm_0_0', message: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None."
Received an error from server:
Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None.
Encountered error: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None.
Encountered error: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None.
Exception ignored in: <function InferenceServerClient.del at 0x7f6e5c42f910>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 257, in del
File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 265, in close
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2250, in close
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2231, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'
It turns out the backend config in the tensorrt_llm file should be set to tensorrtllm instead of the default python. Problem solved and close this issue.
System Info
GPU:H100 With latest v0.9.0 code and image from ngc
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Normal stream output.
actual behavior
Exception as follows: 0603 11:27:18.631800 29219 pb_stub.cc:751] "Failed to process the request(s) for model 'tensorrt_llm_0_0', message: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None." Received an error from server: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None. Encountered error: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None. Encountered error: Python model 'tensorrt_llm_0_0' is using the decoupled mode and the execute function must return None. Exception ignored in: <function InferenceServerClient.del at 0x7f6e5c42f910> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 257, in del File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 265, in close File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2250, in close File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2231, in _close AttributeError: 'NoneType' object has no attribute 'StatusCode'
additional notes
In previous versions, things are normal.