kubeedge / ianvs

Distributed Synergy AI Benchmarking
https://ianvs.readthedocs.io
Apache License 2.0
115 stars 46 forks source link

add: Heterogeneous Multi-Edge Collaborative Neural Network Inference for High Mobility Scenarios: Base on KubeEdge-Ianvs proposal #115

Closed wyoung1 closed 1 month ago

wyoung1 commented 4 months ago

Heterogeneous Multi-Edge Collaborative Neural Network Inference for High Mobility Scenarios: Base on KubeEdge-Ianvs proposal

What type of PR is this? /kind design

What this PR does / why we need it: The PR is a proposal to Enhance the capabilities of existing multi-edge inference paradigm by providing automatic partitioning and scheduling functions, and create a new benchmarking job tailored for high-mobility scenarios. Which issue(s) this PR fixes: https://github.com/kubeedge/ianvs/issues/100

Fixes #

tangming1996 commented 3 months ago

I am more concerned about several points:

  1. How to achieve automatic device discovery, if not automatic discovery, then how users need to join these devices;
  2. What is the automatic scheduling strategy? For example, I have a GPU on some devices, some devices do not have a GPU, and the GPU type of the device may not be consistent, then what is our scheduling strategy here?
  3. How do we support heterogeneous devices, such as Android devices, Apple devices, Linux devices, windows devices, etc., how do we support them, with the help of other open source implementations or our own implementations?
  4. I feel that this is more suitable for sedna control surface ability rather than as a certain algorithm ability, should be made universal;

The above mentioned may be some preconditions that users of this project must understand. If some open source capabilities are used, they should be explained in the proposal. If we implement it by ourselves, there should be a architecture diagram to explain how to achieve it

MooreZheng commented 3 months ago

@hsj576 might also need to take a look at this proposal

wyoung1 commented 3 months ago

Pros:

  1. On-edge scenarios have been provided in the background.
  2. The architecture has been added into the proposal.

Cons:

  1. The layer of users is not clear. Need to change the indicator of objects (job.yaml).
  2. It is not clear why this proposal leverages a pipeline partition. It would be suggested to discuss the selection among pipeline, tensor, and data partition in this proposal.

Thanks for the advice! I have revised the architectural diagram and modified the statements regarding pipeline parallelism. Our original intention was to implement model parallelism after automatic model partitioning, but for the data dependency issues in model parallelism under multiple requests, users can customize the pipeline parallelism algorithm based on the partitioning capabilities we provide.

wyoung1 commented 3 months ago

I am more concerned about several points:

  1. How to achieve automatic device discovery, if not automatic discovery, then how users need to join these devices;
  2. What is the automatic scheduling strategy? For example, I have a GPU on some devices, some devices do not have a GPU, and the GPU type of the device may not be consistent, then what is our scheduling strategy here?
  3. How do we support heterogeneous devices, such as Android devices, Apple devices, Linux devices, windows devices, etc., how do we support them, with the help of other open source implementations or our own implementations?
  4. I feel that this is more suitable for sedna control surface ability rather than as a certain algorithm ability, should be made universal;

The above mentioned may be some preconditions that users of this project must understand. If some open source capabilities are used, they should be explained in the proposal. If we implement it by ourselves, there should be a architecture diagram to explain how to achieve it

Thanks for the advice! I apologize for some inaccuracies in the proposal and I have made the necessary revisions. Regarding the points you asked about, I believe it should be as follows:

  1. Device discovery is crucial when offloading computing tasks to edge devices. Some tools like EdgeMesh can be used to discover devices and request participation. However, this is out of the scope of our project. In Ianvs, we only need to focus on automatic model partitioning. The number of devices is simulated by the user and declared in the devices.yaml file.

  2. For scheduling strategies, unlike scheduling in real scenarios, the scheduling in our project refers to running the partitioned model on the appropriate device based on its computing power and communication bandwidth. Users can simulate devices themselves through methods like Docker and then declare the characteristics of the devices in the devices.yaml file.

  3. For heterogeneous devices, Docker can be used to abstract these differences. However, referring to existing ReID benchmarking job, this is something users need to consider. We only need to provide the partitioning algorithm.

  4. Yes, this is very suitable for the application scenario of Sedna. Our initial consideration was to integrate into Sedna, but we plan to implement a basic partitioning algorithm on Ianvs to verify the effect first, and integrating into Sedna will be our future work.

In summary, our final benchmarking job should be like this: The specific heterogeneous environment should be simulated by the user themselves (through Docker or limiting GPU memory, etc.), and then the user needs to declare a devices.yaml, specifying the number of heterogeneous devices in his simulated environment, the specific information of each device (GPU memory, etc.), and the communication bandwidth between devices. Then, our Ianvs module does the job of parsing this yaml, calculating the matching list of devices and computational subgraphs based on our own algorithm, and returning this matching list as well as the partitioned subgraphs.

hsj576 commented 3 months ago

The proposal looks fine to me.

tangming1996 commented 3 months ago

I am more concerned about several points:我比较关心的有几点:

  1. How to achieve automatic device discovery, if not automatic discovery, then how users need to join these devices;如何实现设备的自动发现,如果不是自动发现,那么用户需要如何加入这些设备;
  2. What is the automatic scheduling strategy? For example, I have a GPU on some devices, some devices do not have a GPU, and the GPU type of the device may not be consistent, then what is our scheduling strategy here?什么是自动调度策略?例如,我在某些设备上有 GPU,而某些设备没有 GPU,并且设备的 GPU 类型可能不一致,那么我们这里的调度策略是什么?
  3. How do we support heterogeneous devices, such as Android devices, Apple devices, Linux devices, windows devices, etc., how do we support them, with the help of other open source implementations or our own implementations?我们如何支持异构设备,如Android设备、Apple设备、Linux设备、Windows设备等,我们如何借助其他开源实现或我们自己的实现来支持它们?
  4. I feel that this is more suitable for sedna control surface ability rather than as a certain algorithm ability, should be made universal;我觉得这个更适合于sedna控制面能力,而不是作为某种算法能力,应该做到通用化;

The above mentioned may be some preconditions that users of this project must understand. If some open source capabilities are used, they should be explained in the proposal. If we implement it by ourselves, there should be a architecture diagram to explain how to achieve it以上提到的可能是这个项目的用户必须了解的一些前提条件。如果使用了某些开源能力,则应在提案中进行说明。如果我们自己实现它,应该有一个架构图来解释如何实现它

Thanks for the advice! I apologize for some inaccuracies in the proposal and I have made the necessary revisions. Regarding the points you asked about, I believe it should be as follows:谢谢你的建议!对于提案中的一些不准确之处,我深表歉意,并已进行了必要的修改。关于你问的几点,我认为应该有以下几点:

  1. Device discovery is crucial when offloading computing tasks to edge devices. Some tools like EdgeMesh can be used to discover devices and request participation. However, this is out of the scope of our project. In Ianvs, we only need to focus on automatic model partitioning. The number of devices is simulated by the user and declared in the devices.yaml file.在将计算任务卸载到边缘设备时,设备发现至关重要。EdgeMesh 等一些工具可用于发现设备并请求参与。但是,这超出了我们项目的范围。在 Ianvs 中,我们只需要关注自动模型划分。设备数量由用户模拟,并在 devices.yaml 文件中声明。
  2. For scheduling strategies, unlike scheduling in real scenarios, the scheduling in our project refers to running the partitioned model on the appropriate device based on its computing power and communication bandwidth. Users can simulate devices themselves through methods like Docker and then declare the characteristics of the devices in the devices.yaml file.对于调度策略,与真实场景中的调度不同,我们项目中的调度是指根据设备的计算能力和通信带宽,在合适的设备上运行分区模型。用户可以通过Docker等方法自行模拟设备,然后在devices.yaml文件中声明设备的特性。
  3. For heterogeneous devices, Docker can be used to abstract these differences. However, referring to existing ReID benchmarking job, this is something users need to consider. We only need to provide the partitioning algorithm.对于异构设备,可以使用 Docker 来抽象这些差异。但是,参考现有的 ReID 基准测试工作,这是用户需要考虑的事情。我们只需要提供分区算法。
  4. Yes, this is very suitable for the application scenario of Sedna. Our initial consideration was to integrate into Sedna, but we plan to implement a basic partitioning algorithm on Ianvs to verify the effect first, and integrating into Sedna will be our future work.是的,这对于Sedna的应用场景来说非常合适。我们最初考虑的是集成到 Sedna 中,但我们计划在 Ianvs 上实现一个基本的分区算法来先验证效果,集成到 Sedna 中将是我们未来的工作。

In summary, our final benchmarking job should be like this: The specific heterogeneous environment should be simulated by the user themselves (through Docker or limiting GPU memory, etc.), and then the user needs to declare a devices.yaml, specifying the number of heterogeneous devices in his simulated environment, the specific information of each device (GPU memory, etc.), and the communication bandwidth between devices. Then, our Ianvs module does the job of parsing this yaml, calculating the matching list of devices and computational subgraphs based on our own algorithm, and returning this matching list as well as the partitioned subgraphs.总的来说,我们最终的基准测试工作应该是这样的:特定的异构环境应该由用户自己模拟(通过Docker或限制GPU内存等),然后用户需要声明一个devices.yaml,指定他模拟的环境中的异构设备的数量,每个设备的具体信息(GPU内存等), 以及设备之间的通信带宽。然后,我们的 Ianvs 模块完成解析此 yaml 的工作,根据我们自己的算法计算设备和计算子图的匹配列表,并返回此匹配列表以及分区的子图。

Good job! Thanks for your reply. I think later, when your work is done, it is necessary to create a tutorial to guide users how to use this algorithm

kubeedge-bot commented 2 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign moorezheng after the PR has been reviewed. You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeedge/ianvs/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
MooreZheng commented 1 month ago

/lgtm

kubeedge-bot commented 1 month ago

New changes are detected. LGTM label has been removed.