microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.05k stars 1.82k forks source link

Once the experiment reaches a certain point, it generally stops running and reports an error. #5802

Open EternityJune25 opened 3 months ago

EternityJune25 commented 3 months ago

Describe the issue:

"Once the experiment reaches a certain point, it generally stops running and reports an error."

[2024-08-09 23:59:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 10 Traceback (most recent call last): File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data self._handle_final_metric_data(data) File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data self.tuner.receive_trialresult(id, _trialparams[id], value, customized=customized, File "/root/miniconda3/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result params = self._running_params.pop(parameter_id) KeyError: 10

content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "PERIODICAL", "sequence": 199, "value": "0.2895440735801888"}' } [2024-08-10 00:00:06] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'ME', content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "FINAL", "sequence": 0, "value": "0.2898187191127104"}' } [2024-08-10 00:00:07] INFO (NNIManager) Trial job YbXt7 status changed from RUNNING to SUCCEEDED [2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'EN', content: '{"trial_job_id":"YbXt7","event":"SUCCEEDED","hyper_params":"{\"parameter_id\": 12, \"parameter_source\": \"algorithm\", \"parameters\": {\"activate\": \"elu\", \"d_emb\": 64, \"d_hid\": 32, \"drop\": 0.3884039376983632, \"gamma\": 6.4905452738897065, \"l1\": 1.4578424787079767, \"l2\": 38.44410448714523, \"l4\": 0.29277084068918136, \"lr\": 9.015207683143664e-05, \"mask\": 0.004542790568841141, \"mode\": \"GAT\", \"t\": 0.6139793721895512, \"mask_edge\": 0.07705512469912157, \"instance_temperature\": 0.6737029785000441, \"cluster_temperature\": 0.5472419195458156}, \"parameter_index\": 0}"}' } [2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'GE', content: '1' }

Environment:

Configuration:

Log message:

How to reproduce it?:

DiamondNova commented 2 months ago

I have the same issue.