Open studywolf opened 1 year ago
I encountered the same problem, sometimes it stopped after about ten trials ,and sometimes it stopped after more than 100 trials. I haven't found what caused the problem.
I also have similar problems.
I have the same issue as well and looking forward the solution.
Environment: NNI version: 3.0 Training service (local|remote|pai|aml|etc): local Client OS: ubuntu 22.04.3 Server OS (for remote mode only): Python version: 3.10.13 PyTorch/TensorFlow version: PyTorch 2.1.0 Is conda/virtualenv/venv used?: virtualenv Is running in Docker?: no
same issue.
[2023-11-29 21:52:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) 1 Traceback (most recent call last): File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data self._handle_final_metric_data(data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data self.tuner.receive_trialresult(id, _trialparams[id], value, customized=customized, File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result params = self._running_params.pop(parameter_id) KeyError: 1 [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"EN","content":"{\"trial_job_id\":\"..._index\\\": 0}\"}"}' [402 bytes] [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"GE","content":"1"}' [27 bytes] [2023-11-29 21:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PING '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PONG '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % sending keepalive ping [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PING c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PONG c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % received keepalive pong [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > TEXT '{"type": "bye"}' [17 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSING [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) < CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSED [2023-11-29 21:52:34] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
I set trial_concurrency==8 and it always stopped at 10~14 trials.
Same issue, I set trial_concurrency=16 and stopped at ~20 trials, the dispatcher is terminated
Same issue here on the latest version of NNI. It seems random how many trials along it gets each time. Always
params = self._running_params.pop(parameter_id)
KeyError: ...
in dispatcher.log
.
I think the problem went away after downgrading to nni<3
.
I faced the same problem, and in my case, a stopgap solution is to use "Anneal" tuner instead of "TPE" tuner. Hope it help!
I found anything above 2.5 gives me the problem, been okay up to the hard coded memory limit with version 2.5 (roughly 45k trials)
Describe the issue:
When I set trialConcurrency > 1, NNI fails out with
When the trialConcurrency = n > 1, then NNI runs n trials and fails out with this error. This happens for all the different n values i've tried (2, 5, 10, 100). When trialConcurrency=1, no problems.
Environment:
Configuration:
I haven't created a minimal reproducible example yet, I'm hoping someone might recognize this problem, as it seems pretty basic and maybe is just a version issue somewhere?