Open Missmiaom opened 10 months ago
maybe you set end_id in request? I came across the same issue as yours #100 In my case if I set end_id to 2, I got the same problem as yours. If I do not pass the end_id, everything works well except the inference won't stop on end_id.
@CaesarWWK Hi, I have set end_id to 0
@CaesarWWK Hi, I have set end_id to 0
Yes, you can entirly remove end_id and test again.
Hi @Missmiaom @CaesarWWK @byshiue
I got same issue, any update? just remove end_id
will work?
not work for me @calico-niko , I gave up using Inflight batching and switched to Dynamic batching
Hi @Missmiaom @CaesarWWK @byshiue I got same issue, any update? just remove
end_id
will work?
Yes it works for me.
Disable enable_trt_overlap
seems work for me
When using tensor parallelism, the computing power usage of one of the GPUs drops to 0, while the usage of the other GPU rises to 100%, the request does not respond, and the service cannot handle new request.
gpt_model_type is V1. (non Inflight batching)
@byshiue