microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.03k stars 1.81k forks source link

NNI is infinitely waiting when running in remote mode #4072

Open ZhiyuanChen opened 3 years ago

ZhiyuanChen commented 3 years ago

Discussed in https://github.com/microsoft/nni/discussions/4070

Originally posted by **ZhiyuanChen** August 14, 2021 ``` [2021-08-14 10:13:41] INFO (NNIDataStore) Datastore initialization done [2021-08-14 10:13:41] INFO (RestServer) RestServer start [2021-08-14 10:13:41] INFO (RestServer) RestServer base port is 8080 [2021-08-14 10:13:41] INFO (main) Rest server listening on: http://0.0.0.0:8080 [2021-08-14 10:13:42] INFO (NNIManager) Starting experiment: VBgChK3z [2021-08-14 10:13:42] INFO (NNIManager) Setup training service... [2021-08-14 10:13:42] INFO (TrialDispatcher) TrialDispatcher: GPU scheduler is enabled. [2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine1 [2021-08-14 10:13:42] INFO (RemoteEnvironmentService) connecting to machine2 [2021-08-14 10:13:42] INFO (NNIManager) Setup tuner... [2021-08-14 10:13:42] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2021-08-14 10:13:42] INFO (NNIManager) Add event listeners [2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0} [2021-08-14 10:13:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0} [2021-08-14 10:13:44] INFO (RemoteEnvironmentService) ssh connection initialized! [2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: started channel: WebCommandChannel [2021-08-14 10:13:44] INFO (TrialDispatcher) TrialDispatcher: copying code and settings. [2021-08-14 10:13:44] INFO (TrialDispatcher) Initialize environments total number: 2 [2021-08-14 10:13:44] INFO (TrialDispatcher) Assign environment service remote to environment fzYKh [2021-08-14 10:13:45] INFO (TrialDispatcher) requested environment fzYKh and job id is nni_exp_VBgChK3z_env_fzYKh. [2021-08-14 10:13:45] INFO (TrialDispatcher) Assign environment service remote to environment D6h8H [2021-08-14 10:13:46] INFO (TrialDispatcher) requested environment D6h8H and job id is nni_exp_VBgChK3z_env_D6h8H. [2021-08-14 10:13:46] INFO (TrialDispatcher) TrialDispatcher: run loop started. [2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.004360754539476665}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2021-08-14 10:13:47] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.0035240674041291577}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } ```
ZhiyuanChen commented 3 years ago

After I dive into the log on remote machine, I found it is because the port on master machine is not opened. Consider raise an Error in such case

ZhiyuanChen commented 3 years ago

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

QuanluZhang commented 3 years ago

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

for the training services other than local, yes, one more port is needed

ZhiyuanChen commented 3 years ago

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

for the training services other than local, yes, one more port is needed

Maybe check if the port is opened first and raise a systemerror?

QuanluZhang commented 3 years ago

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

for the training services other than local, yes, one more port is needed

Maybe check if the port is opened first and raise a systemerror?

good suggestion, we will consider this feature, also welcome your contribution if possible

scarlett2018 commented 3 years ago

More precisely, it is resulted since nni require an additional port (8081 in my case) for web socket communication.

for the training services other than local, yes, one more port is needed

Maybe check if the port is opened first and raise a systemerror?

Thanks @ZhiyuanChen for using NNI and engage with us on finding the root cause, you are highly encouraged to contribute a fix for the issue as you had already find the root cause, just one step away, looking forward to your fist PR to NNI =D.