Cloud-edge collaborative inference for LLM based on KubeEdge-Ianvs

hsj576 commented 1 month ago

What would you like to be added/modified: This issue aims to build a cloud-edge collaborative inference framework for LLM on KubeEdge-Ianvs. Namely, it aims to help all cloud-edge LLM developers improve inference accuracy with strong privacy and fast inference speed. This issue includes:

Implement a benchmark of LLM tasks (e.g. basic LLM tasks such as user question answering, code generation, or text translation) in KubeEdge-Ianvs.
An example of LLM cloud-edge collaborative inference implemented in KubeEdge-Ianvs.
(advance) Implement cloud-edge collaborative algorithms for LLM, such as Speculative decoding, etc. .

Why is this needed: At present, LLM models with the scale of 10 billion and 100 billion parameters, led by Llama2-70b and Qwen-72b, can only be deployed in the cloud with sufficient computing power to provide inference services. However, for users of edge terminals, on the one hand, cloud LLM services face the problem of slow inference speed and long response delay; on the other hand, uploading edge private data to the cloud for processing may face the risk of privacy disclosure. At the same time, the inference accuracy of LLM models that can be deployed in edge environments (such as TinyLlama-1.1b) is much lower than that of cloud LLM. Therefore, using cloud LLM or edge LLM alone cannot simultaneously take into account privacy protection, real-time inference and inference accuracy. Therefore, we need to combine the advantages of high inference accuracy of cloud LLM with strong privacy and fast inference of edge LLM through the strategy of cloud edge collaboration, so as to better meet the needs of edge users.

Recommended Skills: KubeEdge-Ianvs, Python, Pytorch, LLMs

Useful links: Introduction to Ianvs Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

MooreZheng commented 1 month ago

If anyone has questions regarding this issue, please feel free to leave a message here. We would also appreciate it if new members could introduce themselves to the community.

IcyFeather233 commented 1 month ago

Hi! To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging?

hsj576 commented 1 month ago

Hi! To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging?

Yes, student of this OSPP project needs to have access to at least one consumer-grade GPU (2080,3090, etc.). However, since this project mainly focuses on LLM inference, it does not require so much computing resources. For edge LLM models, if your available computing resources is limited, you can choose small-scale LLM models such as TinyLlama-1.1b, Qwen1.5-0.5B, etc. These models can be deployed even on a personal laptop for inference. For cloud LLM models, if your computing resources are not sufficient enough to support the deployment of LLM at a scale of 10 billion or 100 billion, you can use GPT-4, Claude3, Kimi, GLM4 and other commercial LLMs with open API.

kubeedge / ianvs

Cloud-edge collaborative inference for LLM based on KubeEdge-Ianvs #96