intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Apache License 2.0
2.1k stars 204 forks source link

Is there a solution to accelerate the inference of large models through multi-core? #873

Closed Liu-xiandong closed 8 months ago

Liu-xiandong commented 8 months ago

Is there a solution to accelerate the inference of large models through multi-core? The current approach is to assign the operator's tasks to multiple cores such as GEMM and GEMV, or to split the model?

airMeng commented 8 months ago

Hi xiandong, don't quite understand the question. Currently for GEMM we are using openmp to utilize multi-core, do you want more details of multi-core, or

to split the model

you want to know more about multi-socket, AKA tensor parallelism?

Liu-xiandong commented 8 months ago

Hi xiandong, don't quite understand the question. Currently for GEMM we are using openmp to utilize multi-core, do you want more details of multi-core, or

to split the model

you want to know more about multi-socket, AKA tensor parallelism?

Hi airMeng. Thanks for the reply, my question is not accurate. I'm curious about what parallel strategy will be adopted on multi-core processors, data parallelism, model parallelism, pipeline parallelism or other strategy?

airMeng commented 8 months ago

Operator level, activation and weight tensors are both splitted to different core. You can turn to Parallel chapter here Model level we leverage so-called tensor parallelism, only weight tensors are dispatched to different sockets. You can refer to this doc

Liu-xiandong commented 8 months ago

Thank you. I understand that the tensor parallelism is used when batch_size=1. If batch_size > 1, shouldn't the task be bound to different threads, such as one thread handle one task. So that multiple cores can be utilized?

Liu-xiandong commented 8 months ago

Thank you. I understand that the tensor parallelism is used when batch_size=1. If batch_size > 1, shouldn't the task be bound to different threads, such as one thread handle one task. So that multiple cores can be utilized?

For Inference.

airMeng commented 8 months ago

Thank you. I understand that the tensor parallelism is used when batch_size=1. If batch_size > 1, shouldn't the task be bound to different threads, such as one thread handle one task. So that multiple cores can be utilized?

In most GEMM, batch size (or M dimension in GEMM) is relatively small, often less than 4. To enhance GEMM efficiency, as you're aware, a larger M is required for effective cache utilization and pipeline execution. Consequently, parallel processing along the small M dimension is generally not implemented. You can refer to some related information: https://petewarden.com/2015/10/25/an-engineers-guide-to-gemm/. So whatever batch size is or isn't 1, we will split along N dimensions because weight tensors is much larger, and the loading of weight is the bottleneck of GEMM.

Regarding other operators, since activation functions typically involve small tensors, splitting along the batch size dimension might not always be advantageous due to the overhead associated with multi-core dispatching. We will also do some parallel especially on larger activation.

If you observe any chance to optimize multi-thread performance, even just ideas, please don't hesitate to let us know. We are welcome for any help from community.

Liu-xiandong commented 8 months ago

Thank you for your patient reply, I have fully understood.