mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
230 stars 139 forks source link

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

Closed 787918582 closed 1 month ago

787918582 commented 9 months ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) A clear and concise description of what the bug is. hrnet_w32、hrnet_w48执行静态图模式分布式训练均报错

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --distribute True --data_dir /ImageNet_Origin/ Expected behavior / 预期结果 (Mandatory / 必填) 可跑通静态图分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填) If applicable, add screenshots to help explain your problem. [2023-11-19 10:29:13] mindcv.scheduler.scheduler_factory WARNING - warmup_epochs + decay_epochs > num_epochs. Please check and reduce decay_epochs! [2023-11-19 10:29:16] mindcv.train INFO - Essential Experiment Configurations: MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 0 Distributed mode: True Number of devices: 8 Number of training samples: 800000 Number of validation samples: None Number of classes: 1000 Number of batches: 781 Batch size: 128 Auto augment: randaug-m7-mstd0.5 MixUp: 0.2 CutMix: 1.0 Model: hrnet_w32 Model parameters: 41303464 Number of epochs: 5 Optimizer: adamw Learning rate: 0.001 LR Scheduler: cosine_decay Momentum: 0.9 Weight decay: 0.05 Auto mixed precision: O2 Loss scale: 1024(fixed) [2023-11-19 10:29:16] mindcv.train INFO - Start training [ERROR] PIPELINE(171895,ffff914f2190,python):2023-11-19-10:29:53.881.102 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171893,ffffbe9fb190,python):2023-11-19-10:29:54.378.528 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171887,ffff9b3ad190,python):2023-11-19-10:29:54.825.669 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171889,ffff87cee190,python):2023-11-19-10:29:55.189.347 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171890,ffff91938190,python):2023-11-19-10:29:55.439.711 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171894,ffff929f0190,python):2023-11-19-10:29:55.738.301 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171888,ffff8a2c7190,python):2023-11-19-10:29:56.666.323 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [ERROR] PIPELINE(171891,ffffb5509190,python):2023-11-19-10:29:57.019.842 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi [WARNING] MD(171895,fffc8ffff1e0,python):2023-11-19-10:30:19.682.318 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:1168] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result GetNext timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it. Traceback (most recent call last): File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 323, in train(args) File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 309, in train trainer.train(args.epoch_size, loader_train, callbacks=callbacks, dataset_sink_mode=args.dataset_sink_mode) File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1068, in train self._train(epoch, File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 114, in wrapper func(self, *args, kwargs) File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 623, in _train self._train_dataset_sink_process(epoch, train_dataset, list_callback, File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process outputs = train_network(inputs) File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 680, in call out = self.compile_and_run(args, kwargs) File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run self.compile(*args, **kwargs) File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 997, in compile _cell_graph_executor.compile(self, phase=self.phase, File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/common/api.py", line 1547, in compile result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode()) RuntimeError: For operation 'setitem', current input arguments types are <Tuple, Number, Tensor>. The 1-th argument type 'Tuple' is not supported now. the support arguments types of 'setitem' operation as follows: <List, Number, Number> <List, Number, String> <List, Number, List> <List, Number, Tuple> <List, Number, Tensor> <List, Slice, Number> <List, Slice, List> <List, Slice, Tuple> <List, Slice, Tensor> <Tensor, None, Number> <Tensor, None, List> <Tensor, None, Tuple> <Tensor, None, Tensor> <Tensor, Ellipsis, Number> <Tensor, Ellipsis, List> <Tensor, Ellipsis, Tuple> <Tensor, Ellipsis, Tensor> <Tensor, Number, Number> <Tensor, Number, List> <Tensor, Number, Tuple> <Tensor, Number, Tensor> <Tensor, List, Number> <Tensor, List, List> <Tensor, List, Tuple> <Tensor, List, Tensor> <Tensor, Tuple, Number> <Tensor, Tuple, List> <Tensor, Tuple, Tuple> <Tensor, Tuple, Tensor> <Tensor, Slice, Number> <Tensor, Slice, List> <Tensor, Slice, Tuple> <Tensor, Slice, Tensor> <Tensor, Tensor, Number> <Tensor, Tensor, List> <Tensor, Tensor, Tuple> <Tensor, Tensor, Tensor> <Dictionary, Number, Number> <Dictionary, Number, List> <Dictionary, Number, Tuple> <Dictionary, Number, Tensor> <Dictionary, Number, Dictionary> <Dictionary, String, Number> <Dictionary, String, List> <Dictionary, String, Tuple> <Dictionary, String, Tensor> <Dictionary, String, Dictionary> <Dictionary, Tuple, Number> <Dictionary, Tuple, List> <Dictionary, Tuple, Tuple> <Dictionary, Tuple, Tensor> <Dictionary, Tuple, Dictionary> <Dictionary, Tensor, Number> <Dictionary, Tensor, List> <Dictionary, Tensor, Tuple> <Dictionary, Tensor, Tensor> <Dictionary, Tensor, Dictionary> <MapTensor, Tensor, Tensor> For more details with 'setitem', please refer to https://mindspore.cn/search/en?inputValue=Index%20value%20assignment

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.

tacyi commented 7 months ago

ms2.2.10.B180复现该报错

tacyi commented 7 months ago

MindSpore_v2.2.10.B180 完整性训练成功