Closed aaditya-srivathsan closed 5 days ago
@aaditya-srivathsan We are reviewing this ticket will get back to with updates.
@ganeshku1 any update on this?
@aaditya-srivathsan We are working on resolving this issue. Will update this thread once this issue is resolved.
cc: @dyastremsky
Hi @aaditya-srivathsan, I've seen some similar issues reported that were solved by setting --use_custom_all_reduce disable
.
Can you try this to see if it helps?
Sure let me try this and ill let you know if this works or not
This did help thank you very much!
System Info
A100 160GB(2*80)
Who can help?
@byshiue @kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Build for source by cloning the main on tensorrtllm_backend
Download weights from HF
Set Directory and generate engines
Then Start your triton server like so
Finally in a separate terminal
Expected behavior
The expected behavior should be getting thoughput and latency numbers
actual behavior
Command just hangs and doesnt return anything
additional notes
I wrote a custom script which uses gprc over tritonclient to send synchronous requests. Initially it completes the request in 8seconds but after 40 such requests it just hangs.
The tritonserver logs after verbosity are like this
And never returns a response back and just hands
Quantization to int 4 doesnt help either