kubeedge / sedna

AI tookit over KubeEdge
https://sedna.readthedocs.io
Apache License 2.0
508 stars 166 forks source link

The deploy problem of Example Four, Federal learning surface detaction: edge keep restart #225

Open JasonNing96 opened 3 years ago

JasonNing96 commented 3 years ago

1) I have a question about the dataset deploye, It's run commend on Cloud? image 2) My surface-defect-detection-train- is keeping restart and error between edge1 and edge 2. image When logs the pod it shown : image And docker logs shown: image Other pod was working, but the tarin-work down. And the server seen running: image

JasonNing96 commented 3 years ago

by the way I'm change the version of V 0.3.0 because my docker images build V0.4.0:

image

JasonNing96 commented 3 years ago

Here is yml I used.

kubectl create -f - <<EOF apiVersion: sedna.io/v1alpha1 kind: FederatedLearningJob metadata: name: surface-defect-detection spec: aggregationWorker: model: name: "surface-defect-detection-model" template: spec: nodeName: $CLOUD_NODE containers:

llhuii commented 3 years ago

@JoeyHwong-gk

llhuii commented 3 years ago

@JasonNing96 try newer version: v0.4.2

JasonNing96 commented 3 years ago

I followed by the online installe page, it should be the lastest version, right ? Or Install local I will try image

llhuii commented 3 years ago

I means try example version v0.4.2.

I just tried kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 is OK, but the image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.2@sha256:47fd842ce9947 reported the following error:

[INFO][08:27:05]: Client: simple
[INFO][08:27:05]: Trainer: basic
[INFO][08:27:05]: Algorithm: fedavg
Traceback (most recent call last):
  File "train.py", line 60, in <module>
    main()
  File "train.py", line 57, in main
    fl_model.run()
AttributeError: 'FederatedLearningV2' object has no attribute 'run'
llhuii commented 3 years ago

by the way I'm change the version of V 0.3.0 because my docker images build V0.4.0:

image

I think you don't need to build the example image by youself.

llhuii commented 3 years ago

I means try example version v0.4.2.

I just tried kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 is OK, but the image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.2@sha256:47fd842ce9947 reported the following error:

[INFO][08:27:05]: Client: simple
[INFO][08:27:05]: Trainer: basic
[INFO][08:27:05]: Algorithm: fedavg
Traceback (most recent call last):
  File "train.py", line 60, in <module>
    main()
  File "train.py", line 57, in main
    fl_model.run()
AttributeError: 'FederatedLearningV2' object has no attribute 'run'

@jaypume @XinYao1994 please take a look

jaypume commented 3 years ago

Maybe fl_model.train() should be used here instead of fl_model.run() , and we will fix it ASAP.