microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.06k stars 1.82k forks source link

ERROR (nni.runtime.msg_dispatcher_base/Thread-2) #5769

Open C-Comfundo opened 7 months ago

C-Comfundo commented 7 months ago

Describe the issue: I created the trial by nnictl create --config xx --p xxxx For a while I use nnictl experiment --all to check it, and find it stopped. The dispatcher.log shows the error below. But the corresponding process is still running in gpu. btw in the last time I use nni, this error didn't occur. I don't know what caused it.

Environment:

Configuration:

Log message:

Error: tuner_command_channel: Tuner closed connection at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12) Emitted 'error' event at: at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22) at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14) at WebSocket.emit (node:events:538:35) [... lines matching original stack trace ...] at TCP. (node:net:687:12) Thrown at: at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26) at emit (node:events:538:35) at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at emit (node:events:526:28) at node:net:687:12

How to reproduce it?: