Open XinYao1994 opened 3 years ago
Hi @XinYao1994 ,it's very nice of integrating Plato to Federated Learning feature of Sedna.
But I don't think the CRD interface should expose "PlatoConfig" to sedna user, because of:
So, I think Plato parameter should be expanded to sedna CRD interface.
The following table is the expanding solution:
Plato | Parameters | Value | Integration Approach | description |
---|---|---|---|---|
clients | type | Mistnet | As param of training script | - |
total_clients | 2 | Delete | In sedna, number of total_clients is auto detected. | |
client_per_round | 1 | As a param of aggregation script | Client selection should be a function of aggregation algorithm. | |
do_test | false | As code snippet of training script | - | |
data | datasource | data_1 | As param of training script | Need to unify the data format. |
partition_size | 128 | Delete | The function of simulation and production should be decoupled. It may need a dataset mgnt component of Sedna. | |
sampler | iid | Delete | Same as above | |
trainer | type | yolov5 | As build-in training script of Sedna | - |
rounds | 1 | As a param of aggregation script | Exit_check() should be a function of aggregation algorithm. | |
parallelized | false | Add new type param to Sedna | It may add a param type topology to Sedna. | |
max_concurrency | 3 | Add new type param to Sedna | It may add a param type topology to Sedna. | |
target_accuracy | 0.99 | As a param of aggregation script | Target_accuracy should be a function of aggregation algorithm. | |
epochs | 500 | As param of training script | - | |
batch_size | 16 | As param of training script | - | |
optimizer | SGD | As param of training script | - | |
linear_lr | false | As param of training script | - | |
model_name | model_1 | As param of training script | Need to find the difference of model in Sedna and model in Plato. | |
algorithm | type | mistnet | As a param of aggregation script | - |
cut_layer | 4 | As a param of aggregation script | - | |
epsilon | 100 | As a param of aggregation script | - |
Hello, thanks for your suggestions! @jaypume
1.Add Plato Config into the CRD interface.
2.Classification of Plato config parameters and suggestions for adding parameters in Sedna.
Catalog | Name | Meaning | Plato | Sedna |
---|---|---|---|---|
Cluster Specification | total_num_clients | The total number of clients | clients.total_clients | automatic |
Cluster Specification | maximum_num_clients | The maximum number of clients running concurrently | trainer.max_concurrency | - |
Cluster Specification | device | Whether the training should use multiple GPUs if available | trainer.parallelized | needed |
Data Specification | - | The training and testing dataset | data.datasource | service |
Data Specification | - | Where the dataset is located | data.data_path | service |
Data Specification | optimal | Number of samples in each partition | data.partition_size | - |
Data Specification | optimal | IID or non-IID | data.sampler | - |
Model Specification | overwrite | The machine learning model | trainer.model | service |
Stop Condition | rounds | The maximum number of training rounds | trainer.rounds | needed |
Stop Condition | target_accuracy | The target accuracy | trainer.target_accuracy | needed |
Stop Condition | delta_loss | The convergence condition of the loss function | - | needed |
Learning Hyperparameters | epochs | # of epochs for local training in each communication round | trainer.epochs | supported |
Learning Hyperparameters | batch_size | Batch size for local training in each communication round | trainer.batch_size | supported |
Learning Hyperparameters | optimizer | Optimizer for local training in each communication round | trainer.optimizer | supported |
Learning Hyperparameters | learning_rate | Learning rate for local training in each communication round | trainer.learning_rate | supported |
Learning Hyperparameters | momentum | Momentum for local training in each communication round | trainer.momentum | needed |
Learning Hyperparameters | weight_decay | Weight decay for local training in each communication round | trainer.weight_decay | needed |
Aggregation Protocol | client_type (overwrite) | The type of the client | clients.type | needed |
Aggregation Protocol | algorithm_type (overwrite) | The type of aggregation algorithm = The type of server | algorithm.type | needed |
Aggregation Protocol | select_num_clients | The number of clients selected in each round | clients.per_round | needed |
Aggregation Protocol | optimal | The layers before this layer are used for extracting features | algorithm.cut_layer | needed |
Aggregation Protocol | optimal | Whether to apply local differential privacy | algorithm.epsilon | needed |
Others | overwrite/trainer_type | The type of the trainer, define the training task and training logic | trainer.type | needed |
Others | optimal | Should the clients compute test accuracy locally | clients.do_test | - |
3.Planned support for early stop condition in Plato.
Here is a crd example based on discussion above:
apiVersion: sedna.io/v1alpha1
kind: FederatedLearningJob
metadata:
name: surface-defect-detection
spec:
stopCondition:
operator: "or" # and
conditions:
- operator: ">"
threshold: 100
metric: rounds
- operator: ">"
threshold: 0.95
metric: targetAccuracy
- operator: "<"
threshold: 0.03
metric: deltaLoss
transimitter:
transimitter_list:
- name: "simple" # simple, adaptive_freezing, adaptive_sync ...
parameters:
- name: "sync_frequency"
value: "10"
- name: "adaptive_freezing" # simple, adaptive_freezing, adaptive_sync ...
parameters:
- name: "sync_frequency"
value: "10"
aggregationTrigger:
condition:
operator: ">"
threshold: 5
metric: num_of_ready_clients
aggregationWorker:
model:
name: "surface-defect-detection-model"
template:
spec:
nodeName: "cloud"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
name: agg-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "cut_layer"
value: "4"
- name: "epsilon"
value: "100"
- name: "aggregation_algorithm"
value: "mistnet"
- name: "batch_size"
resources: # user defined resources
limits:
memory: 2Gi
trainingWorkers:
- dataset:
name: "edge1-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge1"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi
- dataset:
name: "edge2-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge2"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi
What would you like to be added/modified:
Plato is a new software framework to facilitate scalable federated learning. So far, Plato has already supported PyTorch and MindSpore. Several advantages are summarized as follow:
This proposal discusses how to integrate Plato into Sedna as a backend for supporting federated learning. @li-ch @baochunli @jaypume
Why is this needed: The motivation of this proposal could be summarized as follow:
Plans:
Overview Sedna aims to provide the following federated learning features:
Therefore, two resources are updated:
Configuration updates in aggregationWorker and trainingWorkers:
How to write Plato code in Sedna The users only need to prepare the configuration file in public storage. The Plato code is settled in the Sedna libraries: An example of configuration file sdd_rcnn.yml:
How to integrate the Dataset in Plato In this part, several functions are added to Dataset.
How to integrate the Models management tools in Plato In this part, several functions are added to Model.
To-Do Lists