microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.92k stars 1.81k forks source link

Remote machine can only run 1 job #4073

Open ZhiyuanChen opened 2 years ago

ZhiyuanChen commented 2 years ago

After report final result and exit, nni does nothing. On portal, it received the final results but still mark experiment as running. When manually stopped, it cleans up and wait infinitely.

2021-08-16 07:49:50,314 [INFO] utils: NNISDK_MEb'{"parameter_id": 1, "trial_job_id": "LE3i6", "type": "FINAL", "sequence": 0, "value": "0"}'
2021-08-16 07:49:50,315 [INFO] utils: final pearson: 0  final spearman: 0       final loss: 0
[2021-08-16 07:49:50] INFO (nni_syslog_trial_LE3i6/Thread-4) NNISDK_MEb'{"parameter_id": 1, "trial_job_id": "LE3i6", "type": "FINAL", "sequence": 0, "value": "0"}'
[2021-08-16 07:55:15.588942] INFO Received command, header: [b'KI00000000000007'], data: [LE3i6]
[2021-08-16 07:55:15] INFO (nni_syslog_runner_runner_l1U4d/MainThread) [2021-08-16 07:55:15.588942] INFO Received command, header: [b'KI00000000000007'], data: [LE3i6]
[2021-08-16 07:55:15.589509] INFO LE3i6: killing trial
[2021-08-16 07:55:15] INFO (nni_syslog_runner_runner_l1U4d/MainThread) [2021-08-16 07:55:15.589509] INFO LE3i6: killing trial
[2021-08-16 07:55:15.607172] INFO LE3i6: clean up trial
[2021-08-16 07:55:15] INFO (nni_syslog_runner_runner_l1U4d/MainThread) [2021-08-16 07:55:15.607172] INFO LE3i6: clean up trial
[2021-08-16 08:05:17.187168] INFO trial runner is idle more than 600 seconds, so exit.
[2021-08-16 08:05:17] INFO (nni_syslog_runner_runner_l1U4d/MainThread) [2021-08-16 08:05:17.187168] INFO trial runner is idle more than 600 seconds, so exit.
[2021-08-16 08:05:17.187896] INFO main_loop exits.
[2021-08-16 08:05:17] INFO (nni_syslog_runner_runner_l1U4d/MainThread) [2021-08-16 08:05:17.187896] INFO main_loop exits.
/opt/conda/lib/python3.8/site-packages/nni/tools/trial_tool/web_channel.py:39: RuntimeWarning: coroutine 'WebSocketCommonProtocol.close' was never awaited
[2021-08-16 08:05:18] INFO (nni_syslog_runner_runner_l1U4d/MainThread) /opt/conda/lib/python3.8/site-packages/nni/tools/trial_tool/web_channel.py:39: RuntimeWarning: coroutine 'WebSocketCommonProtocol.close' was never awaited
  self.client.close()
[2021-08-16 08:05:18] INFO (nni_syslog_runner_runner_l1U4d/MainThread)   self.client.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
QuanluZhang commented 2 years ago

@ZhiyuanChen could you try the latest nni version, i.e., v2.4

ZhiyuanChen commented 2 years ago

@QuanluZhang Ta for the information, I just found some machines are running with 2.4 while some are running with 2.3, I just upgrade all machines and ensured they are all running with 2.4. I'll let you know should it work

ZhiyuanChen commented 2 years ago

image This seems to be not only related to remote mode. I just tried it in local mode, and it keeps running after received final result

QuanluZhang commented 2 years ago

yes, it keeps running until submitted trials are more than maxTrialNumber or the experiment duration exceeds maxExperimentDuration.

ZhiyuanChen commented 2 years ago

yes, it keeps running until submitted trials are more than maxTrialNumber or the experiment duration exceeds maxExperimentDuration.

What I means is the trial keeps running after received final results (which suggests it has stopped and should start a new trial

QuanluZhang commented 2 years ago

yes, it keeps running until submitted trials are more than maxTrialNumber or the experiment duration exceeds maxExperimentDuration.

What I means is the trial keeps running after received final results (which suggests it has stopped and should start a new trial

could you check that is the trial process still there, or the trial process has finished but webui shows it is still running? If it is the former, the problem is mainly in your trial code, your trial is blocked. If it is the latter, then it is a bug of NNI

scarlett2018 commented 2 years ago

@ZhiyuanChen - had you got a chance try this out? is the problem still occurring on your side?

yes, it keeps running until submitted trials are more than maxTrialNumber or the experiment duration exceeds maxExperimentDuration.

What I means is the trial keeps running after received final results (which suggests it has stopped and should start a new trial

could you check that is the trial process still there, or the trial process has finished but webui shows it is still running? If it is the former, the problem is mainly in your trial code, your trial is blocked. If it is the latter, then it is a bug of NNI