[Enhancement Request] Integrate Plato into Sedna as a backend for supporting federated learning

XinYao1994 commented 3 years ago

What would you like to be added/modified:

Plato is a new software framework to facilitate scalable federated learning. So far, Plato has already supported PyTorch and MindSpore. Several advantages are summarized as follow:

Simplicity: Plato provides user-friendly APIs.
Scalability: Plato is scalable. Plato also supports running multiple (unlimited) workers, which share one GPU in turn.
Extensibility: Plato manages various machine learning models and aggregation algorithms.
Framework-agnostic: Most of the codebases in Plato can be used with various machine learning libraries.
Hierarchical Design: Plato supports multiple-level cells, including edge-cloud (2 levels) federated learning and device-edge-cloud (3 levels) federated learning.

This proposal discusses how to integrate Plato into Sedna as a backend for supporting federated learning. @li-ch @baochunli @jaypume

Why is this needed: The motivation of this proposal could be summarized as follow:

Algorithm: Sedna (Aggregator) currently supports FedAvg. With Plato, Sedna can choose various aggregation algorithms, such as FedAvg, Adaptive Freezing, Mistnet, and Adaptive sync.
Dataset: Sedna needs to manually prepare the user data. With Plato, it can provide a "datasources" module, including various public datasets (e.g., cifar10, cinic10, and coco). Non-iid samplers could also be supported.
Model: Sedna specifies the model in the images as a file. It uploads the whole model to the server. With Plato, it can specify all models as user configurations. The Report class can help the worker to determine the strategy of uploading gradients for fast convergence, such as Adaptive Freezing, Nova, Sarah, Mistnet, and so on.

Plans:

Overview Sedna aims to provide the following federated learning features:

Write easy and short configuration files in Sedna to support flexible federated learning setups.
It should handle real datasets in the industry and simulate a non-iid version of public standard dataset in academia.
It should consider how to configure a customized model.

Therefore, two resources are updated:

Dataset: The definition of Dataset
Model: The definition of model

Configuration updates in aggregationWorker and trainingWorkers:

apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
  name: surface-defect-detection
spec:
  aggregationWorker:
    # read and write
    model:
      name: "surface-defect-detection-model"
    platoConfig: 
      url: "sdd_rcnn.yml" # stored in S3 or github
    template:
      spec:
        nodeName: $CLOUD_NODE
        containers:
          - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
            name:  agg-worker
            imagePullPolicy: IfNotPresent
            # env: # user defined environments
            resources:  # user defined resources
              limits:
                memory: 2Gi
    - dataset:
        name: "cloud-surface-defect-detection-dataset"

  trainingWorkers:
    # read only
    model:
      name: "surface-defect-detection-model"
    - dataset:
        name: "edgeX-surface-defect-detection-dataset"
      template:
        spec:
          nodeName: $EDGE1_NODE
          containers:
            - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
              name:  train-worker
              imagePullPolicy: IfNotPresent
              env:  # user defined environments or given by the GlobalManager. 
                - name: "server_ip"
                  value: "localhost"
                - name: "server_port"
                  value: "8000"
              resources:  # user defined resources
                limits:
                  memory: 2Gi

How to write Plato code in Sedna The users only need to prepare the configuration file in public storage. The Plato code is settled in the Sedna libraries: An example of configuration file sdd_rcnn.yml:

clients:
# Type
type: mistnet
# The total number of clients
total_clients: 1
# The number of clients selected in each round
per_round: 1
# Should the clients compute test accuracy locally?
do_test: false

# this will be discarded in the future
# server:
#  address: localhost
#  port: 8000

data:
datasource: sednaDataResource
# Number of samples in each partition
partition_size: 128
# IID or non-IID?
sampler: iid

trainer:
# The type of the trainer
type: yolov5
# The maximum number of training rounds
rounds: 1
# Whether the training should use multiple GPUs if available
parallelized: false
# The maximum number of clients running concurrently
max_concurrency: 3
# The target accuracy
target_accuracy: 0.99
# Number of epoches for local training in each communication round
epochs: 500
batch_size: 16
optimizer: SGD
linear_lr: false
# The machine learning model
model_name: sednaModelResource

algorithm:
# Aggregation algorithm
type: mistnet
cut_layer: 4
epsilon: 100

How to integrate the Dataset in Plato In this part, several functions are added to Dataset.

  apiVersion: sedna.io/v1alpha1
  kind: Dataset
  metadata:
    name: "edge1-surface-defect-detection-dataset"
  spec:
    name: COCO
    data_params: packages/coco128.yaml
    # if download_url is None, the data should be stored in disk by default
    download_url: https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip 
    data_path: ./data/
      train_path: ./data/COCO/coco128/images/train2017/
      test_path: ./data/COCO/coco128/images/train2017/
    # number of classes
    num_classes: 80
    # image size
    image_size: 640
    # class names
    classes:
        [
            "person",
            "bicycle",
            ...
        ]
    # remark
    format: ""
    nodeName: $EDGE1_NODE

How to integrate the Models management tools in Plato In this part, several functions are added to Model.

  apiVersion: sedna.io/v1alpha1
  kind: Model
  metadata:
    name: "surface-defect-detection-model"
  spec:
    model_name: vgg_16
    url: "/model"
    # ONNX (https://onnx.ai/) or specify a framework 
    format: "ckpt"
    framework: "PyTorch"
    model_config: packages/models/vgg_16.yaml
    # if true, the model needs to be loaded from url before training
    pretrained: True

To-Do Lists

[ ] Enhance aggregationWorker and trainingWorkers interfaces
[ ] Datasets interface
[ ] Models management
[ ] Examples and demo presentation
[ ] CV: yolo-v5 demo in KubeEdge
[ ] NLP: huggingface demo in KubeEdge

jaypume commented 3 years ago

Hi @XinYao1994 ，it's very nice of integrating Plato to Federated Learning feature of Sedna.

But I don't think the CRD interface should expose "PlatoConfig" to sedna user, because of:

User configuration is not intuitive For Sedna federated learning task deployers, they need to manually open the sdd_rcnn.yaml file to check deployment parameters, which is cumbersome and unintuitive.
Inelegant interface exposure In Sedna, run-time parameters of the federated task are already configured in CRD. If parameters of the same type are configured in another place, the interface looks untidy.

So, I think Plato parameter should be expanded to sedna CRD interface.

The following table is the expanding solution:

Plato	Parameters	Value	Integration Approach	description
clients	type	Mistnet	As param of training script	-
	total_clients	2	Delete	In sedna, number of total_clients is auto detected.
	client_per_round	1	As a param of aggregation script	Client selection should be a function of aggregation algorithm.
	do_test	false	As code snippet of training script	-
data	datasource	data_1	As param of training script	Need to unify the data format.
	partition_size	128	Delete	The function of simulation and production should be decoupled. It may need a dataset mgnt component of Sedna.
	sampler	iid	Delete	Same as above
trainer	type	yolov5	As build-in training script of Sedna	-
	rounds	1	As a param of aggregation script	Exit_check() should be a function of aggregation algorithm.
	parallelized	false	Add new type param to Sedna	It may add a param type topology to Sedna.
	max_concurrency	3	Add new type param to Sedna	It may add a param type topology to Sedna.
	target_accuracy	0.99	As a param of aggregation script	Target_accuracy should be a function of aggregation algorithm.
	epochs	500	As param of training script	-
	batch_size	16	As param of training script	-
	optimizer	SGD	As param of training script	-
	linear_lr	false	As param of training script	-
	model_name	model_1	As param of training script	Need to find the difference of model in Sedna and model in Plato.
algorithm	type	mistnet	As a param of aggregation script	-
	cut_layer	4	As a param of aggregation script	-
	epsilon	100	As a param of aggregation script	-

XinYao1994 commented 3 years ago

Hello, thanks for your suggestions! @jaypume

1．Add Plato Config into the CRD interface.

Agree with @jaypume

2．Classification of Plato config parameters and suggestions for adding parameters in Sedna.

Catalog	Name	Meaning	Plato	Sedna
Cluster Specification	total_num_clients	The total number of clients	clients.total_clients	automatic
Cluster Specification	maximum_num_clients	The maximum number of clients running concurrently	trainer.max_concurrency	-
Cluster Specification	device	Whether the training should use multiple GPUs if available	trainer.parallelized	needed
Data Specification	-	The training and testing dataset	data.datasource	service
Data Specification	-	Where the dataset is located	data.data_path	service
Data Specification	optimal	Number of samples in each partition	data.partition_size	-
Data Specification	optimal	IID or non-IID	data.sampler	-
Model Specification	overwrite	The machine learning model	trainer.model	service
Stop Condition	rounds	The maximum number of training rounds	trainer.rounds	needed
Stop Condition	target_accuracy	The target accuracy	trainer.target_accuracy	needed
Stop Condition	delta_loss	The convergence condition of the loss function	-	needed
Learning Hyperparameters	epochs	# of epochs for local training in each communication round	trainer.epochs	supported
Learning Hyperparameters	batch_size	Batch size for local training in each communication round	trainer.batch_size	supported
Learning Hyperparameters	optimizer	Optimizer for local training in each communication round	trainer.optimizer	supported
Learning Hyperparameters	learning_rate	Learning rate for local training in each communication round	trainer.learning_rate	supported
Learning Hyperparameters	momentum	Momentum for local training in each communication round	trainer.momentum	needed
Learning Hyperparameters	weight_decay	Weight decay for local training in each communication round	trainer.weight_decay	needed
Aggregation Protocol	client_type (overwrite)	The type of the client	clients.type	needed
Aggregation Protocol	algorithm_type (overwrite)	The type of aggregation algorithm = The type of server	algorithm.type	needed
Aggregation Protocol	select_num_clients	The number of clients selected in each round	clients.per_round	needed
Aggregation Protocol	optimal	The layers before this layer are used for extracting features	algorithm.cut_layer	needed
Aggregation Protocol	optimal	Whether to apply local differential privacy	algorithm.epsilon	needed
Others	overwrite/trainer_type	The type of the trainer, define the training task and training logic	trainer.type	needed
Others	optimal	Should the clients compute test accuracy locally	clients.do_test	-

3．Planned support for early stop condition in Plato.

pre-determined accuracy is already supported (see in Section 2)
early stop condition (planned, see in Section 2)

jaypume commented 3 years ago

Here is a crd example based on discussion above:

apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
  name: surface-defect-detection
spec:
  stopCondition:
    operator: "or" # and
      conditions:
        - operator: ">"
          threshold: 100
          metric: rounds
        - operator: ">"
          threshold: 0.95
          metric: targetAccuracy
        - operator: "<"
          threshold: 0.03
          metric: deltaLoss
  transimitter:
    transimitter_list:
      - name: "simple" # simple, adaptive_freezing, adaptive_sync ...
        parameters:
          - name: "sync_frequency"
            value: "10"
      - name: "adaptive_freezing" # simple, adaptive_freezing, adaptive_sync ...
        parameters:
          - name: "sync_frequency"
            value: "10"
  aggregationTrigger:
    condition:
      operator: ">"
      threshold: 5
      metric: num_of_ready_clients
  aggregationWorker:
    model:
      name: "surface-defect-detection-model"
    template:
      spec:
        nodeName: "cloud"
        containers:
          - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
            name: agg-worker
            imagePullPolicy: IfNotPresent
            env: # user defined environments
              - name: "cut_layer"
                value: "4"
              - name: "epsilon"
                value: "100"
              - name: "aggregation_algorithm"
                value: "mistnet"
              - name: "batch_size"
            resources: # user defined resources
              limits:
                memory: 2Gi
  trainingWorkers:
    - dataset:
        name: "edge1-surface-defect-detection-dataset"
      template:
        spec:
          nodeName: "edge1"
          containers:
            - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
              name: train-worker
              imagePullPolicy: IfNotPresent
              env: # user defined environments
                - name: "batch_size"
                  value: "32"
                - name: "learning_rate"
                  value: "0.001"
                - name: "epochs"
                  value: "1"
              resources: # user defined resources
                limits:
                  memory: 2Gi
    - dataset:
        name: "edge2-surface-defect-detection-dataset"
      template:
        spec:
          nodeName: "edge2"
          containers:
            - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
              name: train-worker
              imagePullPolicy: IfNotPresent
              env: # user defined environments
                - name: "batch_size"
                  value: "32"
                - name: "learning_rate"
                  value: "0.001"
                - name: "epochs"
                  value: "1"
              resources: # user defined resources
                limits:
                  memory: 2Gi

kubeedge / sedna

[Enhancement Request] Integrate Plato into Sedna as a backend for supporting federated learning #50