microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.88k stars 1.81k forks source link

ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 0s #5747

Open CCJing14 opened 4 months ago

CCJing14 commented 4 months ago

Describe the issue: I get the following error in logger: [2024-02-26 16:53:12] ERROR (nni.runtime.command_channel.websocket.channel/MainThread) Failed to receive command. Retry in 0s Traceback (most recent call last): File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 963, in transfer_data message = await self.read_message() File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 1033, in read_message frame = await self.read_data_frame(max_size=self.max_size) File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 1108, in read_data_frame frame = await self.read_frame(max_size) File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 1165, in read_frame frame = await Frame.read( File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/framing.py", line 68, in read data = await reader(2) File "/miniconda3/envs/yolo/lib/python3.8/asyncio/streams.py", line 723, in readexactly await self._wait_for_data('readexactly') File "miniconda3/envs/yolo/lib/python3.8/asyncio/streams.py", line 517, in _wait_for_data await self._waiter asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "miniconda3/envs/yolo/lib/python3.8/site-packages/nni/runtime/command_channel/websocket/channel.py", line 99, in _receive_command command = conn.receive() File "miniconda3/envs/yolo/lib/python3.8/site-packages/nni/runtime/command_channel/websocket/connection.py", line 103, in receive msg = _wait(self._ws.recv()) File "miniconda3/envs/yolo/lib/python3.8/site-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait return future.result() File "miniconda3/envs/yolo/lib/python3.8/concurrent/futures/_base.py", line 444, in result return self.get_result() File "miniconda3/envs/yolo/lib/python3.8/concurrent/futures/_base.py", line 389, in get_result raise self._exception File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 568, in recv await self.ensure_open() File "miniconda3/envs/yolo/lib/python3.8/site-packages/websockets/legacy/protocol.py", line 948, in ensure_open raise self.connection_closed_exc() websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received

Environment:

Configuration:

trialGpuNumber: 1 trialConcurrency: 8 max_trial_number: 10000 tuner: name: TPE classArgs: optimize_mode: maximize trainingService: platform: local useActiveGpu: True

BirdyX commented 1 month ago

Same error, and with cpu loading 100%, any idea to solve this?

BirdyX commented 1 month ago

Same error, and with cpu loading 100%, any idea to solve this?

finally, i recommend install v2.10, and remember pip install "typeguard<3", old version without any problem xd

Gokulakrishnan-DL-CV commented 1 month ago

I am facing a similar issue when doing NAS using NNI after 1 trial. Here is the output from the logger: ERROR (Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s Traceback (most recent call last): File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\channel.py", line 99, in _receive_command command = conn.receive() ^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\connection.py", line 116, in receive return nni.load(msg) ^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\common\serializer.py", line 476, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 259, in loads return _strip_loads(string, hook, True, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 266, in _strip_loads return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json__init__.py", line 359, in loads return cls(kw).decode(s) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23)

@CCJing14 @BirdyX Have you guys faced this issue?

@xuehui1991 Pls look into this!

BirdyX commented 1 month ago

I am facing a similar issue when doing NAS using NNI after 1 trial. Here is the output from the logger: ERROR (Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s Traceback (most recent call last): File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\channel.py", line 99, in _receive_command command = conn.receive() ^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\connection.py", line 116, in receive return nni.load(msg) ^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\common\serializer.py", line 476, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 259, in loads return _strip_loads(string, hook, True, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 266, in _strip_loads return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\jsoninit.py", line 359, in loads return cls(kw).decode(s) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23)

@CCJing14 @BirdyX Have you guys faced this issue?

@xuehui1991 Pls look into this!

I think it's still a problem with nni3.0 version. There may be bugs when using websocket to connect with nni3.0 web sever. I suggest you return to nni2.0 version.

CCJing14 commented 4 weeks ago

I am facing a similar issue when doing NAS using NNI after 1 trial. Here is the output from the logger: ERROR (Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s Traceback (most recent call last): File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\channel.py", line 99, in _receive_command command = conn.receive() ^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\connection.py", line 116, in receive return nni.load(msg) ^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\common\serializer.py", line 476, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 259, in loads return _strip_loads(string, hook, True, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 266, in _strip_loads return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\jsoninit.py", line 359, in loads return cls(kw).decode(s) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23) @CCJing14 @BirdyX Have you guys faced this issue? @xuehui1991 Pls look into this!

I think it's still a problem with nni3.0 version. There may be bugs when using websocket to connect with nni3.0 web sever. I suggest you return to nni2.0 version.

I tried to use nni2.5 version, I got the error: RuntimeError: Builtin name is not found: TPE Thanks! I'll try nni2.0 version.

BirdyX commented 4 weeks ago

I am facing a similar issue when doing NAS using NNI after 1 trial. Here is the output from the logger: ERROR (Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s Traceback (most recent call last): File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\channel.py", line 99, in _receive_command command = conn.receive() ^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\runtime\command_channel\websocket\connection.py", line 116, in receive return nni.load(msg) ^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\nni\common\serializer.py", line 476, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, json_tricks_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 259, in loads return _strip_loads(string, hook, True, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\dev_env\Lib\site-packages\json_tricks\nonp.py", line 266, in _strip_loads return json_loads(string, object_pairs_hook=object_pairs_hook, jsonkwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\jsoninit.py", line 359, in loads return cls(kw).decode(s) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Seewise\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23) @CCJing14 @BirdyX Have you guys faced this issue? @xuehui1991 Pls look into this!

I think it's still a problem with nni3.0 version. There may be bugs when using websocket to connect with nni3.0 web sever. I suggest you return to nni2.0 version.

I tried to use nni2.5 version, I got the error: RuntimeError: Builtin name is not found: TPE Thanks! I'll try nni2.0 version.

I am also using nni2.5, and I use TPE for HPO, it runs well,; you can try to figure out if there are any wrong algorithm names or other errors