mindspore-ai / mindspore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
https://gitee.com/mindspore/mindspore
Apache License 2.0
4.17k stars 688 forks source link

在cpu上运行官方文档上的lenet demo分布式训练Parameter Server模式失败 #113

Open zhangle-dev opened 3 years ago

zhangle-dev commented 3 years ago

Environment

Hardware Environment(Ascend/GPU/CPU):

/device cpu

Software Environment:

Describe the current behavior

我从gitee上clone了mindspore项目,然后想根据 官方文档 将lenet demo修改成Parameter Server模式失败了

Describe the expected behavior

我可以正常运行起来

Steps to reproduce the issue

  1. 通过git clone 下载代码
  2. 用pycharm打开 model_zoo/official/cv/lenet 目录 3.修改model_zoo/official/cv/lenet/train.py 文件 image 4.下载mnist训练数据 image 5.运行代码,启动参数和环境变量如下 image

Related log / screenshot

D:\workspace\mindspore-test\venv\Scripts\python.exe D:/workspace/github/mindspore/model_zoo/official/cv/lenet/train.py --device_target=CPU --data_path=mnist_data --ckpt_path=checkpoint ============== Starting Training ============== Traceback (most recent call last): File "D:/workspace/github/mindspore/model_zoo/official/cv/lenet/train.py", line 70, in model.train(cfg['epoch_size'], ds_train, callbacks=[time_cb, ckpoint_cb, LossMonitor()], dataset_sink_mode=False) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\train\model.py", line 592, in train sink_size=sink_size) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\train\model.py", line 385, in _train self._train_process(epoch, train_dataset, list_callback, cb_params) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\train\model.py", line 513, in _train_process outputs = self._train_network(next_element) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\cell.py", line 331, in call out = self.compile_and_run(inputs) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\cell.py", line 588, in compile_and_run self.compile(inputs) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\cell.py", line 575, in compile _executor.compile(self, inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\common\api.py", line 502, in compile result = self._executor.compile(obj, args_list, phase, use_vm) RuntimeError: mindspore\ccsrc\runtime\device\cpu\kernel_select_cpu.cc:299 SetKernelInfo] Operator[Push] is not support. Trace: In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\optim\momentum.py(36)/ success = F.depend(success, _ps_pull(_ps_push((learning_rate, gradient, momentum), shapes), weight))/ In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\optim\momentum.py(151)/ success = self.hyper_map(F.partial(_momentum_opt, self.opt, self.momentum, lr), gradients, params, moments,/ In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\wrap\cell_wrapper.py(251)/ return F.depend(loss, self.optimizer(grads))/

WARNING: Logging before InitGoogleLogging() is written to STDERR [WARNING] KERNEL(17656,?):2021-2-5 10:12:20 [mindspore\ccsrc\backend\kernel_compiler\cpu\cpu_kernel_factory.cc:92] GetSupportedKernelAttrList] Not registered CPU kernel: op[Push]! [ERROR] DEVICE(17656,?):2021-2-5 10:12:20 [mindspore\ccsrc\runtime\device\cpu\kernel_select_cpu.cc:299] SetKernelInfo] Operator[Push] is not support. Trace: In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\optim\momentum.py(36)/ success = F.depend(success, _ps_pull(_ps_push((learning_rate, gradient, momentum), shapes), weight))/ In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\optim\momentum.py(151)/ success = self.hyper_map(F.partial(_momentum_opt, self.opt, self.momentum, lr), gradients, params, moments,/ In file D:\workspace\mindspore-test\venv\lib\site-packages\mindspore\nn\wrap\cell_wrapper.py(251)/ return F.depend(loss, self.optimizer(grads))/

Special notes for this issue

yuyicg commented 3 years ago

The support for CPU of parameter server training is still under development, at present, only GPU and Ascend devices are supported. It will be released in the subsequent version. Thank you for your feedback.