microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.07k stars 1.82k forks source link

Error: tuner_command_channel: Tuner loses responsive #5464

Open beta1scat opened 1 year ago

beta1scat commented 1 year ago

Describe the issue: Unable to operate stably.

Environment:

Configuration:

Log message:

 - dispatcher.log:

[2023-03-21 10:08:52] INFO (nni.tuner.tpe/MainThread) Using random seed 1596889983 [2023-03-21 10:08:52] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2023-03-21 10:21:17] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Exception on receiving: ConnectionClosedError(None, None, None) [2023-03-21 10:21:17] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Connection lost. Trying to reconnect... [2023-03-21 10:21:17] INFO (nni.runtime.tuner_command_channel.channel/MainThread) Attempt #0, wait 0 seconds... [2023-03-21 10:21:17] INFO (nni.runtime.msg_dispatcher_base/MainThread) Report error to NNI manager: Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 61, in main dispatcher.run() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive command = self._retry_receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive self._channel.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect self._ws = _wait(_connect_async(self._url)) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait return future.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl(), self.open_timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 666, in await_impl await protocol.handshake( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 144, in read_http_response raise InvalidMessage("did not receive a valid HTTP response") from exc websockets.exceptions.InvalidMessage: did not receive a valid HTTP response

[2023-03-21 10:21:17] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Exception on sending: AttributeError("'NoneType' object has no attribute 'send'") [2023-03-21 10:21:17] ERROR (nni.runtime.tuner_command_channel.channel/MainThread) 'NoneType' object has no attribute 'send' Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 61, in main dispatcher.run() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive command = self._retry_receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive self._channel.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect self._ws = _wait(_connect_async(self._url)) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait return future.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl(), self.open_timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 666, in await_impl await protocol.handshake( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 144, in read_http_response raise InvalidMessage("did not receive a valid HTTP response") from exc websockets.exceptions.InvalidMessage: did not receive a valid HTTP response

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 62, in _send self._channel.send(command) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 81, in send _wait(self._ws.send(message)) AttributeError: 'NoneType' object has no attribute 'send' [2023-03-21 10:21:17] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Connection lost. Trying to reconnect... [2023-03-21 10:21:17] INFO (nni.runtime.tuner_command_channel.channel/MainThread) Attempt #0, wait 0 seconds... [2023-03-21 10:21:17] ERROR (nni.runtime.msg_dispatcher_base/MainThread) Connection to NNI manager is broken. Failed to report error. [2023-03-21 10:21:17] ERROR (nni.main/MainThread) did not receive a valid HTTP response Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 85, in main() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 61, in main dispatcher.run() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive command = self._retry_receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive self._channel.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect self._ws = _wait(_connect_async(self._url)) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait return future.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl(), self.open_timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 666, in await_impl await protocol.handshake( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 144, in read_http_response raise InvalidMessage("did not receive a valid HTTP response") from exc websockets.exceptions.InvalidMessage: did not receive a valid HTTP response

 - nnictl stdout and stderr:

[2023-03-21 09:47:28] Creating experiment, Experiment ID: adapter_plate_square_TPE_quniform [2023-03-21 09:47:28] Starting web server... [2023-03-21 09:47:29] WARNING: Timeout, retry... [2023-03-21 09:47:30] Setting up... [2023-03-21 09:47:30] Web portal URLs: http://127.0.0.1:58000 http://10.62.137.83:58000 http://198.18.0.1:58000 node:events:504 throw er; // Unhandled 'error' event ^

Error: tuner_command_channel: Tuner loses responsive at WebSocketChannelImpl.heartbeat (/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:119:30) at listOnTimeout (node:internal/timers:559:17) at processTimers (node:internal/timers:502:7) Emitted 'error' event at: at WebSocketChannelImpl.handleError (/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22) at WebSocketChannelImpl.heartbeat (/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:119:18) at listOnTimeout (node:internal/timers:559:17) at processTimers (node:internal/timers:502:7) Thrown at: at heartbeat (/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:119:30) at listOnTimeout (node:internal/timers:559:17) at processTimers (node:internal/timers:502:7) Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 120, in read_response status_line = await read_line(stream) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/http.py", line 194, in read_line line = await stream.readline() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 524, in readline line = await self.readuntil(sep) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 616, in readuntil await self._wait_for_data('readuntil') File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data await self._waiter File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/niu/miniconda3/envs/halcon/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 85, in main() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/main.py", line 61, in main dispatcher.run() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive command = self._retry_receive() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive self._channel.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect self._ws = _wait(_connect_async(self._url)) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait return future.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl(), self.open_timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/asyncio/tasks.py", line 445, in wait_for return fut.result() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 666, in await_impl await protocol.handshake( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/websockets/legacy/client.py", line 144, in read_http_response raise InvalidMessage("did not receive a valid HTTP response") from exc websockets.exceptions.InvalidMessage: did not receive a valid HTTP response Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1282, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1328, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1277, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1037, in _send_output self.send(msg) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 975, in send self.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f947595e2c0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=58000): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f947595e2c0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/code/halcon/paramsearchhalcon/python/NNI/star_TPE/main.py", line 64, in experiment.run(58000) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/experiment.py", line 183, in run self._wait_completion() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/experiment.py", line 163, in _wait_completion status = self.get_status() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/experiment.py", line 283, in get_status resp = rest.get(self.port, '/check-status', self.url_prefix) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/rest.py", line 43, in get return request('get', port, api, prefix=prefix) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=58000): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f947595e2c0>: Failed to establish a new connection: [Errno 111] Connection refused')) [2023-03-21 09:53:09] Stopping experiment, please wait... [2023-03-21 09:53:09] ERROR: HTTPConnectionPool(host='localhost', port=58000): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9519c34460>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1282, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1328, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1277, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 1037, in _send_output self.send(msg) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/http/client.py", line 975, in send self.connect() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9519c34460>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=58000): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9519c34460>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/experiment.py", line 143, in _stop_impl rest.delete(self.port, '/experiment', self.url_prefix) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/rest.py", line 52, in delete request('delete', port, api, prefix=prefix) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/niu/miniconda3/envs/halcon/lib/python3.10/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=58000): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9519c34460>: Failed to establish a new connection: [Errno 111] Connection refused')) [2023-03-21 09:53:09] WARNING: Cannot gracefully stop experiment, killing NNI process... [2023-03-21 09:53:09] Experiment stopped

beta1scat commented 1 year ago

I see that both the Full test - HPO and Local - linux tests on the home page are in a failed state. Is this related to this issue?

liuzhe-lz commented 1 year ago

We are fixing this issue in v3.0 release.

beta1scat commented 1 year ago

We are fixing this issue in v3.0 release. Thanks for your reply, Is there an expected release time?

liuzhe-lz commented 1 year ago

We are likely to release an alpha build including the fix in this week. The stable release will be released in about 2 weeks.

beta1scat commented 1 year ago

We are likely to release an alpha build including the fix in this week. The stable release will be released in about 2 weeks.

Thanks for your reply!

QingquanBao commented 1 year ago

@liuzhe-lz Hi, could you tell when the stable version would be released? The alpha version seems to have bugs.

liuzhe-lz commented 1 year ago

We are doing bug bash now and it will be released when all known bugs are fixed. What bugs you have encountered? Please give me a short description, thanks!