kubeedge / sedna

AI tookit over KubeEdge
https://sedna.readthedocs.io
Apache License 2.0
503 stars 162 forks source link

Error in federated learning example #241

Open skrlin opened 2 years ago

skrlin commented 2 years ago

What happened: When I follow https://github.com/kubeedge/sedna/tree/main/examples/federated_learning/yolov5_coco128_mistnet, an error occurred while mistnet was deploying federated learning samples

# kubectl logs yolo-v5-train-cnq8h
[2021-11-18 06:26:58,662] aggregation.py(294) [INFO] - /home/data/pretrained/
[2021-11-18 06:26:58,666] aggregation.py(314) [INFO] - address 0.0.0.0, port 7363
[INFO][06:26:58]: Server: mistnet
[INFO][06:26:58]: [Server #7] Started training on 1 clients with 1 per round.
[INFO][06:26:58]: [Server #7] Configuring the server...
[INFO][06:26:58]: Training: 1 rounds or 99.0% accuracy

[INFO][06:26:58]: Trainer: yolov5
[INFO][06:26:59]: Generating new fontManager, this may take some time...
[INFO][06:27:02]: 
                 from  n    params  module                                  arguments                     
[INFO][06:27:02]:   0                -1  1      3520  yolov5.models.common.Focus              [3, 32, 3]                    
[INFO][06:27:02]:   1                -1  1     18560  yolov5.models.common.Conv               [32, 64, 3, 2]                
[INFO][06:27:02]:   2                -1  1     18816  yolov5.models.common.C3                 [64, 64, 1]                   
[INFO][06:27:02]:   3                -1  1     73984  yolov5.models.common.Conv               [64, 128, 3, 2]               
[INFO][06:27:02]:   4                -1  1    156928  yolov5.models.common.C3                 [128, 128, 3]                 
[INFO][06:27:02]:   5                -1  1    295424  yolov5.models.common.Conv               [128, 256, 3, 2]              
[INFO][06:27:02]:   6                -1  1    625152  yolov5.models.common.C3                 [256, 256, 3]                 
[INFO][06:27:02]:   7                -1  1   1180672  yolov5.models.common.Conv               [256, 512, 3, 2]              
[INFO][06:27:02]:   8                -1  1    656896  yolov5.models.common.SPP                [512, 512, [5, 9, 13]]        
[INFO][06:27:02]:   9                -1  1   1182720  yolov5.models.common.C3                 [512, 512, 1, False]          
[INFO][06:27:02]:  10                -1  1    131584  yolov5.models.common.Conv               [512, 256, 1, 1]              
[INFO][06:27:02]:  11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
[INFO][06:27:02]:  12           [-1, 6]  1         0  yolov5.models.common.Concat             [1]                           
[INFO][06:27:02]:  13                -1  1    361984  yolov5.models.common.C3                 [512, 256, 1, False]          
[INFO][06:27:02]:  14                -1  1     33024  yolov5.models.common.Conv               [256, 128, 1, 1]              
[INFO][06:27:02]:  15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
[INFO][06:27:02]:  16           [-1, 4]  1         0  yolov5.models.common.Concat             [1]                           
[INFO][06:27:02]:  17                -1  1     90880  yolov5.models.common.C3                 [256, 128, 1, False]          
[INFO][06:27:02]:  18                -1  1    147712  yolov5.models.common.Conv               [128, 128, 3, 2]              
[INFO][06:27:02]:  19          [-1, 14]  1         0  yolov5.models.common.Concat             [1]                           
[INFO][06:27:02]:  20                -1  1    296448  yolov5.models.common.C3                 [256, 256, 1, False]          
[INFO][06:27:02]:  21                -1  1    590336  yolov5.models.common.Conv               [256, 256, 3, 2]              
[INFO][06:27:02]:  22          [-1, 10]  1         0  yolov5.models.common.Concat             [1]                           
[INFO][06:27:02]:  23                -1  1   1182720  yolov5.models.common.C3                 [512, 512, 1, False]          
[INFO][06:27:02]:  24      [17, 20, 23]  1    229245  yolov5.models.yolo.Detect               [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[INFO][06:27:04]: Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPs
[INFO][06:27:04]: 
[INFO][06:27:04]: Algorithm: mistnet
[INFO][06:27:04]: [Server #7] Loading a pre-trained model.
[INFO][06:27:04]: [Server #7] Loading a model from ./models/pretrained/yolov5.pth.
Traceback (most recent call last):
  File "aggregate.py", line 37, in <module>
    run_server()
  File "aggregate.py", line 33, in run_server
    server.start()
  File "/home/lib/sedna/service/server/aggregation.py", line 324, in start
    self.server.run()
  File "/home/plato/plato/servers/base.py", line 87, in run
    self.configure()
  File "/home/plato/plato/servers/fedavg.py", line 72, in configure
    self.load_trainer()
  File "/home/plato/plato/servers/mistnet.py", line 30, in load_trainer
    self.trainer.load_model()
  File "/home/plato/plato/trainers/basic.py", line 86, in load_model
    self.model.load_state_dict(torch.load(model_path))
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './models/pretrained/yolov5.pth'

I have created /model and /pretrained directories at the locations specified in each node according to the tutorial I find the error code in /home/plato/plato/config.py:123

# Pretrained models
            Config.params['model_dir'] = "./models/pretrained/"
            Config.params['pretrained_model_dir'] = "./models/pretrained/"

I don't know why this happens. Do I need to change it to the correct path of the pre training model and repackage the image?

The docker images information :

kubeedge/sedna-example-federated-learning-mistnet-yolo-client       v0.4.0     70fcd2fc71e2   2 months ago    4.95GB
kubeedge/sedna-example-federated-learning-mistnet-yolo-aggregator   v0.4.0     fd0a0512f024   2 months ago    4.95GB

Environment:

Sedna Version ```console $ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}' # kubeedge/sedna-gm:v0.4.3 $ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}' # kubeedge/sedna-lc:v0.4.3 ```
Kubernets Version ```console $ kubectl version # Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.12", GitCommit:"4bf2e32bb2b9fdeea19ff7cdc1fb51fb295ec407", GitTreeState:"clean", BuildDate:"2021-10-27T17:07:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"} ```
KubeEdge Version ```console $ cloudcore --version # KubeEdge v1.8.2 $ edgecore --version # KubeEdge v1.8.2 ```

CloudSide Environment:

Hardware configuration ```console $ lscpu # 架构: x86_64 CPU 运行模式: 32-bit, 64-bit 字节序: Little Endian CPU: 24 在线 CPU 列表: 0-23 每个核的线程数: 2 每个座的核数: 6 座: 2 NUMA 节点: 2 厂商 ID: GenuineIntel CPU 系列: 6 型号: 45 型号名称: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz 步进: 7 CPU MHz: 2299.795 CPU 最大 MHz: 2500.0000 CPU 最小 MHz: 1200.0000 BogoMIPS: 3999.64 虚拟化: VT-x L1d 缓存: 32K L1i 缓存: 32K L2 缓存: 256K L3 缓存: 15360K NUMA 节点0 CPU: 0-5,12-17 NUMA 节点1 CPU: 6-11,18-23 ```
OS ```console $ cat /etc/os-release # NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04" ```
Kernel ```console $ uname -a # Linux node01 5.4.0-84-generic #94~18.04.1-Ubuntu SMP Thu Aug 26 23:17:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ```
JoeyHwong-gk commented 2 years ago

PTAL @jaypume @XinYao1994

XinYao1994 commented 2 years ago

@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced 2 months ago. Did you make sure that you have successfully updated the image?

llhuii commented 2 years ago

@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced 2 months ago. Did you make sure that you have successfully updated the image?

I think the reason is that @skrlin used v0.4.0 which has a bug. And I suggest you can try the latest version(i.e. v0.4.3).

llhuii commented 2 years ago

@XinYao1994 can you help to update the version of federated learning example yaml?

skrlin commented 2 years ago

@JoeyHwong-gk I didn’t update the mirror, just pulled the v0.4.0 version of the mirror in the depository according to the tutorial

XinYao1994 commented 2 years ago

@llhuii @skrlin @jaypume We have planned to add a tutorial demo recently. Hope that can help. :) Federated learning example yaml will be updated before we release that demo.

skrlin commented 2 years ago

@llhuii OK, thank you very much for your answer

skrlin commented 2 years ago

@XinYao1994 OK, thank you very much for your answer

Poorunga commented 2 years ago

@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced 2 months ago. Did you make sure that you have successfully updated the image?

I think the reason is that @skrlin used v0.4.0 which has a bug. And I suggest you can try the latest version(i.e. v0.4.3).

我也遇到了这个问题,用的是v0.4.3,log如下:

[INFO][02:29:18]: New cache created: data/COCO/coco128/labels/train2017.cache
[INFO][02:29:18]: No clients are launched (server:disable_clients = true)
[INFO][02:29:18]: Starting a server at address 0.0.0.0 and port 7363.
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.613998 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.612923 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] New client with id #2 arrived.
[INFO][02:29:32]: [Server #6] Starting training.
[INFO][02:29:32]: 
[Server #6] Starting round 1/1.
[INFO][02:29:32]: [Server #6] Selecting client #2 for training.
[INFO][02:29:32]: [Server #6] Sending the current model to client #2.
[INFO][02:29:32]: [Server #6] New client with id #1 arrived.
[INFO][02:29:37]: [Server #6] Sent 27.96 MB of payload data to client #2.
[INFO][02:31:31]: [Server #6] Received 400.11 MB of payload data from client #2.
[INFO][02:31:31]: [Server #6] All 1 client reports received. Processing.
[ERROR][02:31:31]: Task exception was never retrieved
future: <Task finished coro=<AsyncServer._handle_event_internal() done, defined at /usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py:502> exception=AttributeError("'list' object has no attribute 'num_train_examples'",)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 504, in _handle_event_internal
    r = await server._trigger_event(data[0], namespace, sid, *data[1:])
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 547, in _trigger_event
    event, *args)
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event
    ret = await handler(*args)
  File "/home/plato/plato/servers/base.py", line 59, in on_client_payload_done
    data['obkey'])
  File "/home/plato/plato/servers/base.py", line 446, in client_payload_done
    await self.process_reports()
  File "/home/plato/plato/servers/mistnet.py", line 40, in process_reports
    sampler = all_inclusive.Sampler(feature_dataset)
  File "/home/plato/plato/samplers/all_inclusive.py", line 18, in __init__
    self.all_inclusive = range(dataset.num_train_examples())
AttributeError: 'list' object has no attribute 'num_train_examples'
llhuii commented 2 years ago

@XinYao1994 帮忙看一下

@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced 2 months ago. Did you make sure that you have successfully updated the image?

I think the reason is that @skrlin used v0.4.0 which has a bug. And I suggest you can try the latest version(i.e. v0.4.3).

我也遇到了这个问题,用的是v0.4.3,log如下:

[INFO][02:29:18]: New cache created: data/COCO/coco128/labels/train2017.cache
[INFO][02:29:18]: No clients are launched (server:disable_clients = true)
[INFO][02:29:18]: Starting a server at address 0.0.0.0 and port 7363.
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.613998 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.612923 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] New client with id #2 arrived.
[INFO][02:29:32]: [Server #6] Starting training.
[INFO][02:29:32]: 
[Server #6] Starting round 1/1.
[INFO][02:29:32]: [Server #6] Selecting client #2 for training.
[INFO][02:29:32]: [Server #6] Sending the current model to client #2.
[INFO][02:29:32]: [Server #6] New client with id #1 arrived.
[INFO][02:29:37]: [Server #6] Sent 27.96 MB of payload data to client #2.
[INFO][02:31:31]: [Server #6] Received 400.11 MB of payload data from client #2.
[INFO][02:31:31]: [Server #6] All 1 client reports received. Processing.
[ERROR][02:31:31]: Task exception was never retrieved
future: <Task finished coro=<AsyncServer._handle_event_internal() done, defined at /usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py:502> exception=AttributeError("'list' object has no attribute 'num_train_examples'",)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 504, in _handle_event_internal
    r = await server._trigger_event(data[0], namespace, sid, *data[1:])
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 547, in _trigger_event
    event, *args)
  File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event
    ret = await handler(*args)
  File "/home/plato/plato/servers/base.py", line 59, in on_client_payload_done
    data['obkey'])
  File "/home/plato/plato/servers/base.py", line 446, in client_payload_done
    await self.process_reports()
  File "/home/plato/plato/servers/mistnet.py", line 40, in process_reports
    sampler = all_inclusive.Sampler(feature_dataset)
  File "/home/plato/plato/samplers/all_inclusive.py", line 18, in __init__
    self.all_inclusive = range(dataset.num_train_examples())
AttributeError: 'list' object has no attribute 'num_train_examples'
XinYao1994 commented 2 years ago

@Poorunga @llhuii Please make sure you have used the most updated version because it has been fixed at here