triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

dose support have many same model instance in one GPU device? #72

Closed changleilei closed 1 year ago

changleilei commented 1 year ago

Now I have a 3090 GPU. I deployed the Triton service using the fasttransformer as the backend. The model is GPT2, and there is still a lot of GPU memory left. So I want to deploy more same models to improve throughput, but when I set instance group count to 2, the throughput does not increase but decreases,and the request exception rate also increases.

My goal is to obtain greater throughput. What can I do in this case? Build more Triton services?

byshiue commented 1 year ago

What's your model size and problem size? Can you provide the measurement 1 instance and 2 instances?

In computation view, 1 server with 2 instances are same to 2 servers with 1 instance for each.

changleilei commented 1 year ago

The model size is about 6G GPU memory when deployed. and when used 2 instances the memory of GPU is not increased. i will try to deploy more servers to increase throughput. thank you for your reply !

changleilei commented 1 year ago

Hi @byshiue ,i have been build 2 servers in one GPU, but it was slower than single card with single service, a queue is scheduling the requests, and the batch merge is not work because the memory of GPU is not increase. So whether it is that a GPU can only deploy one service, no matter how large its memory is? Just depends on the GPU TFLOPS?

nvidia-smi as blow: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:65:00.0 Off | N/A | | 0% 42C P8 32W / 370W | 12352MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 4005709 C ...onserver/bin/tritonserver 6175MiB | | 0 N/A N/A 4005737 C ...onserver/bin/tritonserver 6175MiB | +-----------------------------------------------------------------------------+

The Logs: 2022-11-14 06:07:20,667 | WARNING | task.py | add_task | 114 | Task queue depth is 18 2022-11-14 06:07:20,764 | WARNING | task.py | add_task | 114 | Task queue depth is 19 2022-11-14 06:07:20,941 | WARNING | task.py | add_task | 114 | Task queue depth is 20 2022-11-14 06:07:21,103 | WARNING | task.py | add_task | 114 | Task queue depth is 21 2022-11-14 06:07:21,265 | WARNING | task.py | add_task | 114 | Task queue depth is 22 2022-11-14 06:07:21,431 | WARNING | task.py | add_task | 114 | Task queue depth is 23 2022-11-14 06:07:21,593 | WARNING | task.py | add_task | 114 | Task queue depth is 24 2022-11-14 06:07:21,777 | WARNING | task.py | add_task | 114 | Task queue depth is 25 2022-11-14 06:07:21,930 | WARNING | task.py | add_task | 114 | Task queue depth is 26 2022-11-14 06:07:22,097 | WARNING | task.py | add_task | 114 | Task queue depth is 27 2022-11-14 06:07:22,201 | WARNING | task.py | add_task | 114 | Task queue depth is 28 2022-11-14 06:07:22,433 | WARNING | task.py | add_task | 114 | Task queue depth is 29 2022-11-14 06:07:22,682 | WARNING | task.py | add_task | 114 | Task queue depth is 30 2022-11-14 06:07:22,839 | WARNING | task.py | add_task | 114 | Task queue depth is 31 2022-11-14 06:07:23,017 | WARNING | task.py | add_task | 114 | Task queue depth is 32 2022-11-14 06:07:23,175 | WARNING | task.py | add_task | 114 | Task queue depth is 33 2022-11-14 06:07:23,339 | WARNING | task.py | add_task | 114 | Task queue depth is 34

byshiue commented 1 year ago

Do you mean you can send the requests to both severs randomly, but the final latency is worse than single server? Can you post the perf you observe? Note that when you have two processes, only one process can use the GPU each time if you don't enable the MPS. You can find more data in https://docs.nvidia.com/deploy/mps/index.html.

And what's meaning of "and the batch merge is not work because the memory of GPU is not increase"?

changleilei commented 1 year ago

Yeah, I use Nginx to process multiple simultaneous requests through load balancing and used Jemeter to test the server QPS, and My result is that, 1 GPU with 1 server, deployment consumes about 6GB GPU memory, when inference the memory cost about 9GB memory, the QPS is 6.6/sec,abnormal rate is 0.0%. 1GPU with 2 server as above, deployment consumes about 12GB GPU memory, when inference the memory cost is still about 12GB memory, the QPS is 5.3/sec,abnormal rate is 98.0%. So i guess the dynamic batching(batch merge) is not work,because large batch will cost more memory. I will try the MPS. Thank you !!