microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14k stars 1.81k forks source link

Failed to receive command error in runtime JSONDecodeError #5684

Open msuzen opened 1 year ago

msuzen commented 1 year ago

Describe the issue:

Custom NAS job with Pytorch models gives command error from NNI runtime, see below for the message. Job only completes if exp.config.max_trial_number is equal to exp.config.trial_concurrency.

Environment:

Error Message:

[2023-09-21 11:22:18] Waiting for models submitted to engine to finish...
[2023-09-21 11:22:35] ERROR: Failed to receive command. Retry in 0s
Traceback (most recent call last):
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/runtime/command_channel/websocket/channel.py", line 99, in _receive_command
    command = conn.receive()
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/user/command_channel/websocket/connection.py", line 116, in receive
    return nni.load(msg)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/common/serializer.py", line 476, in load
    return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 259, in loads
    return _strip_loads(string, hook, True, **jsonkwargs)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
    return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23)
[2023-09-21 11:22:36] Experiment is completed.
[2023-09-21 11:22:36] Search process is done. You can put an `time.sleep(FOREVER)` here to block the process if you want to continue viewing the experiment.
[2023-09-21 11:22:36] Stopping experiment, please wait...
[2023-09-21 11:22:36] Checkpoint saved to /Users/user/nni-experiments/w6mz14pl/checkpoint.
[2023-09-21 11:22:36] Experiment stopped
mo-tion commented 1 year ago

pushing this, as I have the same issue

ElbazHaim commented 11 months ago

Hello, I am also encountering the same issue, with the exact same error message. From looking at the logs, it looks like this happens exactly when the first trial is over.

AlondraMM commented 10 months ago

Me too, any way to fix it?

liuzhengx commented 10 months ago

Had the same issue.

jimmy133719 commented 8 months ago

Has anyone solved the problem?

z520yu commented 6 months ago

i have same problem

Mingbo-Lee commented 5 months ago

I have the same problem.

haoshuai-orka commented 5 months ago

I think I've found a way around this issue. In _nni/nni/runtime/command_channel/websocket/connection.py_, find the class WsConnection its receive function, and then for the function nni.load inside, pass _ignore_comments=False_

Mingbo-Lee commented 5 months ago

I think I've found a way around this issue. In _nni/nni/runtime/command_channel/websocket/connection.py_, find the class WsConnection its receive function, and then for the function nni.load inside, pass _ignore_comments=False_

Thank you very much!

ranranrannervous commented 5 months ago

I think I've found a way around this issue. In _nni/nni/runtime/command_channel/websocket/connection.py_, find the class WsConnection its receive function, and then for the function nni.load inside, pass _ignore_comments=False_

Does it look like this? ` def receive(self) -> Command | None: """ Return received message; or return None if the connection has been closed by peer. """ try: msg = _wait(self._ws.recv()) _logger.debug(f'Received {msg}') except websockets.ConnectionClosed: # type: ignore _logger.debug('Connection closed by server.') self._ws = None _decrease_refcnt() raise

    if msg is None:
        return None
    # seems the library will inference whether it's text or binary, so we don't have guarantee
    if isinstance(msg, bytes):
        msg = msg.decode()
    return nni.load(msg, ignore_comments=False)`
haoshuai-orka commented 5 months ago

I think I've found a way around this issue. In _nni/nni/runtime/command_channel/websocket/connection.py_, find the class WsConnection its receive function, and then for the function nni.load inside, pass _ignore_comments=False_

Does it look like this? def receive(self) -> Command | None: """ Return received message; or returnNone` if the connection has been closed by peer. """ try: msg = _wait(self._ws.recv()) _logger.debug(f'Received {msg}') except websockets.ConnectionClosed: # type: ignore _logger.debug('Connection closed by server.') self._ws = None _decrease_refcnt() raise

    if msg is None:
        return None
    # seems the library will inference whether it's text or binary, so we don't have guarantee
    if isinstance(msg, bytes):
        msg = msg.decode()
    return nni.load(msg, ignore_comments=False)`

Yes. Exactly. For my case, there are some strings that probably are not comments, but are regarded as comments in the json decoding phase, which leads to the failure. I just set the ignore_comments to be False and then it works.