An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
[2024-07-03 16:45:49] INFO (nni.tuner.tpe/MainThread) Using random seed 668056533
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'hidden_sizes': {'_type': 'choice', '_value': [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]}, 'learning_rate': {'_type': 'loguniform', '_value': [1e-06, 0.1]}, 'batch_size': {'_type': 'choice', '_value': [32, 64, 128]}, 'num_epochs': {'_type': 'randint', '_value': [100, 1000]}, 'dropout_prob': {'_type': 'uniform', '_value': [0, 0.5]}, 'use_batch_norm': {'_type': 'choice', '_value': [True, False]}, 'activation_fn': {'_type': 'choice', '_value': ['relu', 'leaky_relu', 'sigmoid', 'tanh', 'elu', 'selu']}, 'patience': {'_type': 'randint', '_value': [0, 10]}}
[2024-07-03 17:10:21] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 45
Traceback (most recent call last):
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlers[command](data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 45
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
**How to reproduce it?**:
It happens not just once for me, but occasionally with different experiments. I tried lowering concurrency to 1 in order to avoid it, but it appears nonetheless.
In this example, it was trial 45 evidently which caused the crash. In the web ui, I can see that trial 45 succeeded and there is a recorded metric value for it. Yet, when TPE goes to find its parameters, it seems it cannot find them?
Describe the issue: It seems the dispatcher crashes for me from unknown causes, and when this happens, my experiment stops running.
Environment:
Configuration:
Log message: