The deploy problem of Example Four, Federal learning surface detaction: edge keep restart

JasonNing96 commented 3 years ago

1) I have a question about the dataset deploye, It's run commend on Cloud? 2) My surface-defect-detection-train- is keeping restart and error between edge1 and edge 2. When logs the pod it shown : And docker logs shown: Other pod was working, but the tarin-work down. And the server seen running：

JasonNing96 commented 3 years ago

by the way I'm change the version of V 0.3.0 because my docker images build V0.4.0:

JasonNing96 commented 3 years ago

Here is yml I used.

kubectl create -f - <<EOF apiVersion: sedna.io/v1alpha1 kind: FederatedLearningJob metadata: name: surface-defect-detection spec: aggregationWorker: model: name: "surface-defect-detection-model" template: spec: nodeName: $CLOUD_NODE containers:

image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.4.0 name: agg-worker imagePullPolicy: IfNotPresent env: # user defined environments
- name: "exit_round" value: "3" resources: # user defined resources limits: memory: 2Gi trainingWorkers:
  - dataset: name: "edge1-surface-defect-detection-dataset" template: spec: nodeName: $EDGE1_NODE containers:
    - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 name: train-worker imagePullPolicy: IfNotPresent env: # user defined environments
      - name: "batch_size" value: "32"
      - name: "learning_rate" value: "0.001"
      - name: "epochs" value: "2" resources: # user defined resources limits: memory: 2Gi
  - dataset: name: "edge2-surface-defect-detection-dataset" template: spec: nodeName: $EDGE2_NODE containers:
    - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 name: train-worker imagePullPolicy: IfNotPresent env: # user defined environments
      - name: "batch_size" value: "32"
      - name: "learning_rate" value: "0.001"
      - name: "epochs" value: "2" resources: # user defined resources limits: memory: 2Gi EOF