microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.06k stars 1.82k forks source link

Can not build connection to server's port. #5661

Open Data-reindeer opened 1 year ago

Data-reindeer commented 1 year ago

Describe the issue: I tried to run nnictl create --config scripts/exp.yml -p 8079, but got error information: [2023-08-14 12:52:48] Creating experiment, Experiment ID: 7fgdry6w [2023-08-14 12:52:48] Starting web server... [2023-08-14 12:52:49] WARNING: Timeout, retry... [2023-08-14 12:52:50] WARNING: Timeout, retry... [2023-08-14 12:52:51] ERROR: Create experiment failed. Actually, I find that there is a closed issue #5126 about similar problem but with a little difference. My 7fgdry6w directory only have /log directory and have no /db directory. The detailed trace back information are as below: Traceback (most recent call last): File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 976, in send self.connect() File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fcec414a810>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8079): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcec414a810>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anaconda3/envs/env1/bin/nnictl", line 8, in sys.exit(parse_args()) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 503, in parse_args args.func(args) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 91, in create_experiment exp.start(port, debug, RunMode.Detach) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/experiment.py", line 135, in start self._start_impl(port, debug, run_mode, None, []) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/experiment.py", line 104, in _start_impl self.url_prefix, tuner_command_channel, tags) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/launcher.py", line 148, in start_experiment raise e File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/launcher.py", line 126, in start_experiment _check_rest_server(port, url_prefix=url_prefix) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/launcher.py", line 196, in _check_rest_server rest.get(port, '/check-status', url_prefix) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get return request('get', port, api, prefix=prefix) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, send_kwargs) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8079): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcec414a810>: Failed to establish a new connection: [Errno 111] Connection refused'))

Environment:

Configuration:

trialCommand: python /home/NSL_MRL/main.py --epochs=1000 --lr=5e-4 --batch_size=256 trialGpuNumber: 1 trialConcurrency: 5 maxExperimentDuration: 100000h maxTrialNumber: 100000 tuner: name: GridSearch trainingService: platform: local useActiveGpu: True gpuIndices: [0,1,2,4,5] maxTrialNumberPerGpu: 1

Log message:

How to reproduce it?:

Data-reindeer commented 1 year ago

Moreover, I used the same command nnictl create --config scripts/exp.yml three days ago and it worked and run well. But today when I tried to run it on another port but got the aforementioned error.

And My torch==1.10.2+cu113 and nni==2.10.1.

I can ensure that port 8079 is available.

franzhd commented 1 year ago

Same problem here with nni 3.0

FakeEnd commented 1 year ago

Same problem here with nni 3.0. But when use nni 2.5, the problem will disappear.

FakeEnd commented 1 year ago

I check the nnictl_error.log and find that casued by this: node:/lib64/libm.so.6: version 'GLIBC_2.27' not found (required by node). It seems I need to install GLIBC_2.27. But how can I address it as I have no access to sudo.