triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Questions about different intra-node settings for fastertransformer_backend and FasterTransformer #113

Open YJHMITWEB opened 1 year ago

YJHMITWEB commented 1 year ago

Hi, I am wondering why in FasterTransformer, intra-node GPUs are bound to process-level, while in fastertransformer_backend, it is bound to thread-level? Since their src code are the same, why differs in intra-node binding?

byshiue commented 1 year ago

multi-process is more flexible and stable because we can use it in multi-gpu and multi-node. But in triton server, we hope multiple model instances can share same model, and hence we need to use multi-thread.

YJHMITWEB commented 1 year ago

Hi @byshiue ,

Thanks for the reply. So I am a little bit confused here. When enabling tensor parallelism, FasterTransformer expects it happens intra-node. So for example, each node has 2 GPUs, and we set the tensor_parallel=2, then when loading the model, the weights will be sliced into two parts, and each GPU loads one part. In this case, what do you mean by "we hope multiple model instances can share same model, and hence we need to use multi-thread." As in this case, each thread is responsible for different weights.

Is my understanding correct?

byshiue commented 1 year ago

multi-instances is independent to tp. It is simpler to demonstrate on single gpu. Assume we have single gpu, we create a gpt model on this gpu, and then create 2 model instances based on the gpt model. Then, these two instances can handle different requests and share same weights.

YJHMITWEB commented 1 year ago

Oh, I see. I totally get it, it basically serves for handling different requests. Thanks for the explanation!