kubeedge / sedna

AI tookit over KubeEdge
https://sedna.readthedocs.io
Apache License 2.0
509 stars 166 forks source link

[Enhancement Request] Integrate Plato into Sedna as a backend for supporting federated learning #50

Open XinYao1994 opened 3 years ago

XinYao1994 commented 3 years ago

What would you like to be added/modified:

Plato is a new software framework to facilitate scalable federated learning. So far, Plato has already supported PyTorch and MindSpore. Several advantages are summarized as follow:

  1. Simplicity: Plato provides user-friendly APIs.
  2. Scalability: Plato is scalable. Plato also supports running multiple (unlimited) workers, which share one GPU in turn.
  3. Extensibility: Plato manages various machine learning models and aggregation algorithms.
  4. Framework-agnostic: Most of the codebases in Plato can be used with various machine learning libraries.
  5. Hierarchical Design: Plato supports multiple-level cells, including edge-cloud (2 levels) federated learning and device-edge-cloud (3 levels) federated learning.

This proposal discusses how to integrate Plato into Sedna as a backend for supporting federated learning. @li-ch @baochunli @jaypume

Why is this needed: The motivation of this proposal could be summarized as follow:

  1. Algorithm: Sedna (Aggregator) currently supports FedAvg. With Plato, Sedna can choose various aggregation algorithms, such as FedAvg, Adaptive Freezing, Mistnet, and Adaptive sync.
  2. Dataset: Sedna needs to manually prepare the user data. With Plato, it can provide a "datasources" module, including various public datasets (e.g., cifar10, cinic10, and coco). Non-iid samplers could also be supported.
  3. Model: Sedna specifies the model in the images as a file. It uploads the whole model to the server. With Plato, it can specify all models as user configurations. The Report class can help the worker to determine the strategy of uploading gradients for fast convergence, such as Adaptive Freezing, Nova, Sarah, Mistnet, and so on.

Plans:

  1. Overview Sedna aims to provide the following federated learning features:

    • Write easy and short configuration files in Sedna to support flexible federated learning setups.
    • It should handle real datasets in the industry and simulate a non-iid version of public standard dataset in academia.
    • It should consider how to configure a customized model.

    Therefore, two resources are updated:

    • Dataset: The definition of Dataset
    • Model: The definition of model

    Configuration updates in aggregationWorker and trainingWorkers:

    apiVersion: sedna.io/v1alpha1
    kind: FederatedLearningJob
    metadata:
      name: surface-defect-detection
    spec:
      aggregationWorker:
        # read and write
        model:
          name: "surface-defect-detection-model"
        platoConfig: 
          url: "sdd_rcnn.yml" # stored in S3 or github
        template:
          spec:
            nodeName: $CLOUD_NODE
            containers:
              - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
                name:  agg-worker
                imagePullPolicy: IfNotPresent
                # env: # user defined environments
                resources:  # user defined resources
                  limits:
                    memory: 2Gi
        - dataset:
            name: "cloud-surface-defect-detection-dataset"
    
      trainingWorkers:
        # read only
        model:
          name: "surface-defect-detection-model"
        - dataset:
            name: "edgeX-surface-defect-detection-dataset"
          template:
            spec:
              nodeName: $EDGE1_NODE
              containers:
                - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
                  name:  train-worker
                  imagePullPolicy: IfNotPresent
                  env:  # user defined environments or given by the GlobalManager. 
                    - name: "server_ip"
                      value: "localhost"
                    - name: "server_port"
                      value: "8000"
                  resources:  # user defined resources
                    limits:
                      memory: 2Gi
  2. How to write Plato code in Sedna The users only need to prepare the configuration file in public storage. The Plato code is settled in the Sedna libraries: An example of configuration file sdd_rcnn.yml:

    clients:
    # Type
    type: mistnet
    # The total number of clients
    total_clients: 1
    # The number of clients selected in each round
    per_round: 1
    # Should the clients compute test accuracy locally?
    do_test: false
    
    # this will be discarded in the future
    # server:
    #  address: localhost
    #  port: 8000
    
    data:
    datasource: sednaDataResource
    # Number of samples in each partition
    partition_size: 128
    # IID or non-IID?
    sampler: iid
    
    trainer:
    # The type of the trainer
    type: yolov5
    # The maximum number of training rounds
    rounds: 1
    # Whether the training should use multiple GPUs if available
    parallelized: false
    # The maximum number of clients running concurrently
    max_concurrency: 3
    # The target accuracy
    target_accuracy: 0.99
    # Number of epoches for local training in each communication round
    epochs: 500
    batch_size: 16
    optimizer: SGD
    linear_lr: false
    # The machine learning model
    model_name: sednaModelResource
    
    algorithm:
    # Aggregation algorithm
    type: mistnet
    cut_layer: 4
    epsilon: 100
  3. How to integrate the Dataset in Plato In this part, several functions are added to Dataset.

      apiVersion: sedna.io/v1alpha1
      kind: Dataset
      metadata:
        name: "edge1-surface-defect-detection-dataset"
      spec:
        name: COCO
        data_params: packages/coco128.yaml
        # if download_url is None, the data should be stored in disk by default
        download_url: https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip 
        data_path: ./data/
          train_path: ./data/COCO/coco128/images/train2017/
          test_path: ./data/COCO/coco128/images/train2017/
        # number of classes
        num_classes: 80
        # image size
        image_size: 640
        # class names
        classes:
            [
                "person",
                "bicycle",
                ...
            ]
        # remark
        format: ""
        nodeName: $EDGE1_NODE
  4. How to integrate the Models management tools in Plato In this part, several functions are added to Model.

      apiVersion: sedna.io/v1alpha1
      kind: Model
      metadata:
        name: "surface-defect-detection-model"
      spec:
        model_name: vgg_16
        url: "/model"
        # ONNX (https://onnx.ai/) or specify a framework 
        format: "ckpt"
        framework: "PyTorch"
        model_config: packages/models/vgg_16.yaml
        # if true, the model needs to be loaded from url before training
        pretrained: True

    To-Do Lists

    • [ ] Enhance aggregationWorker and trainingWorkers interfaces
    • [ ] Datasets interface
    • [ ] Models management
    • [ ] Examples and demo presentation
    • [ ] CV: yolo-v5 demo in KubeEdge
    • [ ] NLP: huggingface demo in KubeEdge
jaypume commented 3 years ago

Hi @XinYao1994 ,it's very nice of integrating Plato to Federated Learning feature of Sedna.

But I don't think the CRD interface should expose "PlatoConfig" to sedna user, because of:

So, I think Plato parameter should be expanded to sedna CRD interface.

The following table is the expanding solution:

Plato Parameters Value Integration Approach description
clients type Mistnet As param of training script -
total_clients 2 Delete In sedna, number of total_clients is auto detected.
client_per_round 1 As a param of aggregation script Client selection should be a function of aggregation algorithm.
do_test false As code snippet of training script -
data datasource data_1 As param of training script Need to unify the data format.
partition_size 128 Delete The function of simulation and production should be decoupled. It may need a dataset mgnt component of Sedna.
sampler iid Delete Same as above
trainer type yolov5 As build-in training script of Sedna -
rounds 1 As a param of aggregation script Exit_check() should be a function of aggregation algorithm.
parallelized false Add new type param to Sedna It may add a param type topology to Sedna.
max_concurrency 3 Add new type param to Sedna It may add a param type topology to Sedna.
target_accuracy 0.99 As a param of aggregation script Target_accuracy should be a function of aggregation algorithm.
epochs 500 As param of training script -
batch_size 16 As param of training script -
optimizer SGD As param of training script -
linear_lr false As param of training script -
model_name model_1 As param of training script Need to find the difference of model in Sedna and model in Plato.
algorithm type mistnet As a param of aggregation script -
cut_layer 4 As a param of aggregation script -
epsilon 100 As a param of aggregation script -
XinYao1994 commented 3 years ago

Hello, thanks for your suggestions! @jaypume

1.Add Plato Config into the CRD interface.

2.Classification of Plato config parameters and suggestions for adding parameters in Sedna.

Catalog Name Meaning Plato Sedna
Cluster Specification total_num_clients The total number of clients clients.total_clients automatic
Cluster Specification maximum_num_clients The maximum number of clients running concurrently trainer.max_concurrency -
Cluster Specification device Whether the training should use multiple GPUs if available trainer.parallelized needed
Data Specification - The training and testing dataset data.datasource service
Data Specification - Where the dataset is located data.data_path service
Data Specification optimal Number of samples in each partition data.partition_size -
Data Specification optimal IID or non-IID data.sampler -
Model Specification overwrite The machine learning model trainer.model service
Stop Condition rounds The maximum number of training rounds trainer.rounds needed
Stop Condition target_accuracy The target accuracy trainer.target_accuracy needed
Stop Condition delta_loss The convergence condition of the loss function - needed
Learning Hyperparameters epochs # of epochs for local training in each communication round trainer.epochs supported
Learning Hyperparameters batch_size Batch size for local training in each communication round trainer.batch_size supported
Learning Hyperparameters optimizer Optimizer for local training in each communication round trainer.optimizer supported
Learning Hyperparameters learning_rate Learning rate for local training in each communication round trainer.learning_rate supported
Learning Hyperparameters momentum Momentum for local training in each communication round trainer.momentum needed
Learning Hyperparameters weight_decay Weight decay for local training in each communication round trainer.weight_decay needed
Aggregation Protocol client_type (overwrite) The type of the client clients.type needed
Aggregation Protocol algorithm_type (overwrite) The type of aggregation algorithm = The type of server algorithm.type needed
Aggregation Protocol select_num_clients The number of clients selected in each round clients.per_round needed
Aggregation Protocol optimal The layers before this layer are used for extracting features algorithm.cut_layer needed
Aggregation Protocol optimal Whether to apply local differential privacy algorithm.epsilon needed
Others overwrite/trainer_type The type of the trainer, define the training task and training logic trainer.type needed
Others optimal Should the clients compute test accuracy locally clients.do_test -

3.Planned support for early stop condition in Plato.

jaypume commented 3 years ago

Here is a crd example based on discussion above:

apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
  name: surface-defect-detection
spec:
  stopCondition:
    operator: "or" # and
      conditions:
        - operator: ">"
          threshold: 100
          metric: rounds
        - operator: ">"
          threshold: 0.95
          metric: targetAccuracy
        - operator: "<"
          threshold: 0.03
          metric: deltaLoss
  transimitter:
    transimitter_list:
      - name: "simple" # simple, adaptive_freezing, adaptive_sync ...
        parameters:
          - name: "sync_frequency"
            value: "10"
      - name: "adaptive_freezing" # simple, adaptive_freezing, adaptive_sync ...
        parameters:
          - name: "sync_frequency"
            value: "10"
  aggregationTrigger:
    condition:
      operator: ">"
      threshold: 5
      metric: num_of_ready_clients
  aggregationWorker:
    model:
      name: "surface-defect-detection-model"
    template:
      spec:
        nodeName: "cloud"
        containers:
          - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
            name: agg-worker
            imagePullPolicy: IfNotPresent
            env: # user defined environments
              - name: "cut_layer"
                value: "4"
              - name: "epsilon"
                value: "100"
              - name: "aggregation_algorithm"
                value: "mistnet"
              - name: "batch_size"
            resources: # user defined resources
              limits:
                memory: 2Gi
  trainingWorkers:
    - dataset:
        name: "edge1-surface-defect-detection-dataset"
      template:
        spec:
          nodeName: "edge1"
          containers:
            - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
              name: train-worker
              imagePullPolicy: IfNotPresent
              env: # user defined environments
                - name: "batch_size"
                  value: "32"
                - name: "learning_rate"
                  value: "0.001"
                - name: "epochs"
                  value: "1"
              resources: # user defined resources
                limits:
                  memory: 2Gi
    - dataset:
        name: "edge2-surface-defect-detection-dataset"
      template:
        spec:
          nodeName: "edge2"
          containers:
            - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
              name: train-worker
              imagePullPolicy: IfNotPresent
              env: # user defined environments
                - name: "batch_size"
                  value: "32"
                - name: "learning_rate"
                  value: "0.001"
                - name: "epochs"
                  value: "1"
              resources: # user defined resources
                limits:
                  memory: 2Gi