microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.06k stars 1.82k forks source link

Can't run more than n trials with trialConcurrency=n > 1 #5689

Open studywolf opened 1 year ago

studywolf commented 1 year ago

Describe the issue:

When I set trialConcurrency > 1, NNI fails out with

[2023-09-30 12:57:40] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 7
Traceback (most recent call last):
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
    self._handle_final_metric_data(data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
    self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
    params = self._running_params.pop(parameter_id)
KeyError: 7
[2023-09-30 12:57:41] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-09-30 12:57:44] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

When the trialConcurrency = n > 1, then NNI runs n trials and fails out with this error. This happens for all the different n values i've tried (2, 5, 10, 100). When trialConcurrency=1, no problems.

Environment:

Configuration:

I haven't created a minimal reproducible example yet, I'm hoping someone might recognize this problem, as it seems pretty basic and maybe is just a version issue somewhere?

igodrr commented 1 year ago

I encountered the same problem, sometimes it stopped after about ten trials ,and sometimes it stopped after more than 100 trials. I haven't found what caused the problem.

cehw commented 1 year ago

I also have similar problems.

kv-42 commented 1 year ago

I have the same issue as well and looking forward the solution.

Environment: NNI version: 3.0 Training service (local|remote|pai|aml|etc): local Client OS: ubuntu 22.04.3 Server OS (for remote mode only): Python version: 3.10.13 PyTorch/TensorFlow version: PyTorch 2.1.0 Is conda/virtualenv/venv used?: virtualenv Is running in Docker?: no

wby13 commented 11 months ago

same issue.

[2023-11-29 21:52:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) 1 Traceback (most recent call last): File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data self._handle_final_metric_data(data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data self.tuner.receive_trialresult(id, _trialparams[id], value, customized=customized, File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result params = self._running_params.pop(parameter_id) KeyError: 1 [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"EN","content":"{\"trial_job_id\":\"..._index\\\": 0}\"}"}' [402 bytes] [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"GE","content":"1"}' [27 bytes] [2023-11-29 21:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PING '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PONG '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % sending keepalive ping [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PING c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PONG c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % received keepalive pong [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > TEXT '{"type": "bye"}' [17 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSING [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) < CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSED [2023-11-29 21:52:34] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

I set trial_concurrency==8 and it always stopped at 10~14 trials.

XYxiyang commented 11 months ago

Same issue, I set trial_concurrency=16 and stopped at ~20 trials, the dispatcher is terminated

arvoelke commented 10 months ago

Same issue here on the latest version of NNI. It seems random how many trials along it gets each time. Always

    params = self._running_params.pop(parameter_id)
KeyError: ...

in dispatcher.log.

I think the problem went away after downgrading to nni<3.

datngo93 commented 10 months ago

I faced the same problem, and in my case, a stopgap solution is to use "Anneal" tuner instead of "TPE" tuner. Hope it help!

studywolf commented 9 months ago

I found anything above 2.5 gives me the problem, been okay up to the hard coded memory limit with version 2.5 (roughly 45k trials)