Open ZihanLiao opened 8 months ago
Hi @ZihanLiao, I haven't been able to reproduce the issue you are reporting. I used a slightly simplified version for debugging where I send a cancellation response after 10 tokens:
#Loop over the trtllm responses
count = 0
for trtllm_response in trtllm_responses:
count = count + 1
...
#Send a cancellation request
if count == 10:
stop_trtllm_tensor = [pb_utils.Tensor("stop", np.array([[True]], dtype=bool))]
request_output_names = ["stop"]
trtllm_request = pb_utils.InferenceRequest(
model_name="tensorrt_llm",
inputs=stop_trtllm_tensor,
requested_output_names=request_output_names,
)
_ = trtllm_request.exec()
bls_response_sender.send(
flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
return None
and don't run into any errors. Here's the client output when I print the text_output
tensor.
root@aiap-dt1:/app# python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -p "This is a test" --model-name "tensorrt_llm_bls" --streaming
[b' of']
[b' the']
[b' power']
[b' of']
[b' the']
[b' Internet']
[b'.']
[b' It']
[b"'s"]
What version of Triton are you using? Could you try with 23.12 or 24.01?
Hi @ZihanLiao, I haven't been able to reproduce the issue you are reporting. I used a slightly simplified version for debugging where I send a cancellation response after 10 tokens:
#Loop over the trtllm responses count = 0 for trtllm_response in trtllm_responses: count = count + 1 ... #Send a cancellation request if count == 10: stop_trtllm_tensor = [pb_utils.Tensor("stop", np.array([[True]], dtype=bool))] request_output_names = ["stop"] trtllm_request = pb_utils.InferenceRequest( model_name="tensorrt_llm", inputs=stop_trtllm_tensor, requested_output_names=request_output_names, ) _ = trtllm_request.exec() bls_response_sender.send( flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) return None
and don't run into any errors. Here's the client output when I print the
text_output
tensor.root@aiap-dt1:/app# python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -p "This is a test" --model-name "tensorrt_llm_bls" --streaming [b' of'] [b' the'] [b' power'] [b' of'] [b' the'] [b' Internet'] [b'.'] [b' It'] [b"'s"]
What version of Triton are you using? Could you try with 23.12 or 24.01?
Thanks for your reply! Indeed, the problem is hard to reproduce and might related to this which I'm not sure. This GIL break occurred from time to time when I sequentially requested from the server. I'm using triton with version 23.10. I will try the latest version.
Ok, I know that there were a few issues with the Triton python backend code fixed in Triton 23.12. See https://github.com/triton-inference-server/python_backend/commit/8b0fa4cc5daa4b1891cdc5b0b42079dbe2a60eae and https://github.com/triton-inference-server/python_backend/commit/c5f304decda609ab21a004c525436e58dd527190
Can you try with Triton 23.12 or Triton 24.01? If you can still reproduce after upgrading the Tritonv ersion, I can spend more time on this and work with the Triton team to root cause.
Thank you.
Problem: I added some lines of code to make the server support early stopping. Following is my
model.py
Error:
Fatal Python error: PyEval_SaveThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Version: v0.6.1
There might be a place in execute() didn't release the python thread