mindspore-ai / mindspore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
https://gitee.com/mindspore/mindspore
Apache License 2.0
4.31k stars 709 forks source link

data sink mode does not work #279

Closed Enlion91 closed 6 months ago

Enlion91 commented 7 months ago

Environment

Hardware Environment(Ascend/GPU/CPU):

ascend 910A, atlas800-9000 server

Software Environment:

Describe the current behavior

mindyolo, use data sink mode, train won't start

Describe the expected behavior

speed up train

Steps to reproduce the issue

  1. train with --ms_datasink=True args

Related log / screenshot

image it stays here and does not go on

Special notes for this issue

Enlion91 commented 7 months ago

after control + c and retart train without data_sink, it won't start normally, with lots of python process running

image

image

Ash-Lee233 commented 7 months ago

we recommend to use default config and dataset to gets the acc and performance in readme if you remove datasink mode, you may need to change the config to get the network start training lots of python processing is multiprocessing of data, use parallel function, it is an normal phenomenon

Enlion91 commented 6 months ago

we recommend to use default config and dataset to gets the acc and performance in readme if you remove datasink mode, you may need to change the config to get the network start training lots of python processing is multiprocessing of data, use parallel function, it is an normal phenomenon

Indeed, I use default mindyolo config and follow start instruction. The following is the INFO level log. It stays here forever. image

Enlion91 commented 6 months ago

在华为工程师的帮助下,问题已定位: 子进程不响应退出信号15,流程卡死。临时变更为强制退出规避。 已知问题,在mindspore 2.3 合入了解决措施,但是该措施在我的环境上无效,仍需要kill强制退出。 https://gitee.com/mindspore/mindspore/pulls/66995/files