triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
653 stars 93 forks source link

When using tensor parallelism, the computing power usage of one of the GPU drops to 0 #110

Open Missmiaom opened 10 months ago

Missmiaom commented 10 months ago

When using tensor parallelism, the computing power usage of one of the GPUs drops to 0, while the usage of the other GPU rises to 100%, the request does not respond, and the service cannot handle new request.

gpt_model_type is V1. (non Inflight batching)

@byshiue

CaesarWWK commented 10 months ago

maybe you set end_id in request? I came across the same issue as yours #100 In my case if I set end_id to 2, I got the same problem as yours. If I do not pass the end_id, everything works well except the inference won't stop on end_id.

Missmiaom commented 10 months ago

@CaesarWWK Hi, I have set end_id to 0

CaesarWWK commented 10 months ago

@CaesarWWK Hi, I have set end_id to 0

Yes, you can entirly remove end_id and test again.

calico-niko commented 9 months ago

Hi @Missmiaom @CaesarWWK @byshiue I got same issue, any update? just remove end_id will work?

Missmiaom commented 9 months ago

not work for me @calico-niko , I gave up using Inflight batching and switched to Dynamic batching

CaesarWWK commented 9 months ago

Hi @Missmiaom @CaesarWWK @byshiue I got same issue, any update? just remove end_id will work?

Yes it works for me.

calico-niko commented 9 months ago

Disable enable_trt_overlap seems work for me