microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.88k stars 1.81k forks source link

The system crashes when running the “Hello NAS” example code with GPU #5746

Open misagamisaga opened 4 months ago

misagamisaga commented 4 months ago

The system crashes when running the “Hello NAS” example code with GPU

My steps

I cleared my environment beforehand, there are no extra package conflicts

The details of one of the errors

(Env: Windows11, using conda, pytorch2.2.0) Before the crash, I saw a lot of python.exe in the task manager I recorded the error at that time:

[2024-02-20 22:24:17] Creating experiment, Experiment ID: 5p9fhwgt
[2024-02-20 22:24:17] Starting web server...
[2024-02-20 22:24:20] Setting up...
[2024-02-20 22:24:20] Web portal URLs: http://26.26.26.1:8084 http://169.254.77.17:8084 http://169.254.202.152:8084 http://169.254.67.238:8084 http://192.168.101.15:8084 http://127.0.0.1:8084
[2024-02-20 22:24:21] Successfully update searchSpace.
[2024-02-20 22:24:21] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:24:21] Experiment initialized successfully. Starting exploration strategy...
[2024-02-20 22:24:59] ERROR: Strategy failed to execute.
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "f:\today\nni\try_nni.py", line 144, in <module>
    exp3.run(port=8084)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 236, in run
    return self._run_impl(port, wait_completion, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 205, in _run_impl
    self.start(port, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 270, in start
    self._start_engine_and_strategy()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 230, in _start_engine_and_strategy
    self.strategy.run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 170, in run
    self._run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\bruteforce.py", line 220, in _run
    if not self.wait_for_resource():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 100, in wait_for_resource
    if not self.engine.budget_available():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\execution\training_service.py", line 271, in budget_available
    return self.nodejs_binding.get_status() in ['INITIALIZED', 'RUNNING', 'TUNER_NO_MORE_TRIAL']
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 413, in get_status
    resp = rest.get(self.port, '/check-status', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 43, in get
    return request('get', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:24:59] Stopping experiment, please wait...
[2024-02-20 22:25:00] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:25:20] ERROR: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 171, in _stop_nni_manager
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:25:20] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2024-02-20 22:25:21] ERROR: Failed to receive command. Retry in 0s
534145232 commented 3 months ago

I have the same issue.

Imfire-waw commented 2 months ago

same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU.

ranranrannervous commented 2 months ago

same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU. but it is too slow

zhxn30663 commented 2 months ago

It may caused by dwm.exe or NVIDIA driver. Updating GPU driver or changing to studio version didn't work.

Windows 11 22631.3447, Intel i9-14900HX, RTX4090, Nvidia studio driver 552.22.

raseidi commented 2 days ago

Any news on this?