microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.02k stars 1.81k forks source link

All trails are getting failed #1367

Closed anandhperumal closed 5 years ago

anandhperumal commented 5 years ago

nni Environment:

I'm just starting with NNI, I'm following the document step by step. And I see all my trails are getting failed

The exact command I ran and it's output:

C:\Users\User>nnictl create --config nni\examples\trials\mnist\config_windows.yml --port 8082
INFO: expand searchSpacePath: search_space.json to C:\Users\User\nni\examples\trials\mnist\search_space.json
INFO: expand codeDir: . to C:\Users\User\nni\examples\trials\mnist\.
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is JAqxevyD
The Web UI urls are: http://169.254.170.118:8082   http://192.168.56.1:8082   http://169.254.221.8:8082   http://169.254.68.111:8082   http://10.0.0.215:8082   http://127.0.0.1:8082
-----------------------------------------------------------------------

You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
         commands                       description
1. nnictl experiment show        show the information of experiments
2. nnictl trial ls               list all of trial jobs
3. nnictl top                    monitor the status of running experiments
4. nnictl log stderr             show stderr log content
5. nnictl log stdout             show stdout log content
6. nnictl stop                   stop an experiment
7. nnictl trial kill             kill a trial job by id
8. nnictl --help                 get help information about nnictl
-----------------------------------------------------------------------

I have attached the image of the WebUI which shows the trails has failed image

Moreover, I keep getting a pop-up :

----------------------------------------------------------------
#!/usr/bin/env bash
# Copyright (C) 2014, Alexey Pavlov
#   mailto:alexpux@gmail.com
# This file is part of Minimal SYStem version 2.
#   https://sourceforge.net/p/msys2/wiki/MSYS2%20installation/
# File: cmd

"$COMSPEC" "$@"
--------------------------------------------------------------------

how to disable it?

And where can I see logs? why are my trails failing? Please let me know if I'm missing anything. Any leads will be appreciated.

Thanks

Grzechu11 commented 5 years ago

nni_stderr_empty

Grzechu11 commented 5 years ago

Ok, i found a problem

I change run.ps1 from:

cd D:\Projekty\ML_5DWChallenge\vel5\day2
$env:NNI_PLATFORM="local"
$env:NNI_EXP_ID="f6l0s17p"
$env:NNI_SYS_DIR="C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp"
$env:NNI_TRIAL_JOB_ID="o5Kpp"
$env:NNI_OUTPUT_DIR="C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp"
$env:NNI_TRIAL_SEQ_ID="3"
$env:MULTI_PHASE="false"
$env:CUDA_VISIBLE_DEVICES="-1"
cmd /c python D:\Projekty\ML_5DWChallenge\vel5\day2\nni_day2.py 2>C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp\stderr
$NOW_DATE = [int64](([datetime]::UtcNow)-(get-date "1/1/1970")).TotalSeconds
$NOW_DATE = "$NOW_DATE" + (Get-Date -Format fff).ToString()
Write $LASTEXITCODE " " $NOW_DATE  | Out-File C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp\.nni\state -NoNewline -encoding utf8

I change run.ps1 to:

cd D:\Projekty\ML_5DWChallenge\vel5\day2
$env:NNI_PLATFORM="local"
$env:NNI_EXP_ID="f6l0s17p"
$env:NNI_SYS_DIR="C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp"
$env:NNI_TRIAL_JOB_ID="o5Kpp"
$env:NNI_OUTPUT_DIR="C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp"
$env:NNI_TRIAL_SEQ_ID="3"
$env:MULTI_PHASE="false"
$env:CUDA_VISIBLE_DEVICES="-1"
cmd.exe /c python D:\Projekty\ML_5DWChallenge\vel5\day2\nni_day2.py 2>C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp\stderr
$NOW_DATE = [int64](([datetime]::UtcNow)-(get-date "1/1/1970")).TotalSeconds
$NOW_DATE = "$NOW_DATE" + (Get-Date -Format fff).ToString()
Write $LASTEXITCODE " " $NOW_DATE  | Out-File C:\Users\gdworak\nni\experiments\f6l0s17p\trials\o5Kpp\.nni\state -NoNewline -encoding utf8

I don't know why my powershell don't have command cmd. I need use cmd.exe

Does anyone have an idea how to fix it?

ultmaster commented 5 years ago

Have you ever installed MSYS2? It might be causing a conflict on cmd.

Grzechu11 commented 5 years ago

No, I never installed MSYS2

ultmaster commented 5 years ago

@Grzechu11 It's also possible that there is a file named cmd somewhere under your system path. Please check your environment variables or do a global search for cmd.

Grzechu11 commented 5 years ago

I check a command path, Anaconda has cmd file

image

image

I changed the name of the cmd file And now everything is correct

Thanks for help

xinshouke commented 5 years ago

Hello all, I had also met the same problem...

ultmaster commented 5 years ago

@xinshouke This problem will be fixed in future releases. For now, please change the name of your cmd file or use platforms other than Windows.

xinshouke commented 5 years ago

@ultmaster I had no Anaconda,I just install tensorflow thr pip... Since the problem was fixed, I hope upgrade the nni to resolve the problem, may not I? May I execute the below command as 'python -m pip install --upgrade nni' for this problem?

ultmaster commented 5 years ago

@xinshouke Not before NNI 1.1 is released. For now, you can install NNI from source code follow instructions in README "Install through source code" for testing.

chaos1992 commented 4 years ago

Hello all, I also get the same problem in Linux System. Although I check my log, I cannot find obviously error. Do you have any idea? Please give me some advice

ultmaster commented 4 years ago

@chaos0625 Please elaborate. Including nnimanager.log, dispatcher.log, trial logs, stderrs and system configurations.

chaos1992 commented 4 years ago

@ultmaster Thanks your reply! system configurations: python: 3.6 tensorflow: 1.14.0 nni: 1.1 system: Ubuntu

nnimanager.log: [10/25/2019, 3:47:00 PM] INFO [ 'Datastore initialization done' ] [10/25/2019, 3:47:00 PM] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ] [10/25/2019, 3:47:00 PM] INFO [ 'RestServer start' ] [10/25/2019, 3:47:00 PM] INFO [ 'Construct local machine training service.' ] [10/25/2019, 3:47:00 PM] INFO [ 'RestServer base port is 8080' ] [10/25/2019, 3:47:02 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"python3 mnist.py","codeDir":"/home/gaochao/program/nni-master/examples/trials/mnist/.","gpuNum":0}' ] [10/25/2019, 3:47:02 PM] INFO [ 'required GPU number is 0' ] [10/25/2019, 3:47:02 PM] INFO [ 'Starting experiment: wnc4Z2cq' ] [10/25/2019, 3:47:02 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ] [10/25/2019, 3:47:02 PM] INFO [ 'Add event listeners' ] [10/25/2019, 3:47:02 PM] INFO [ 'Run local machine training service.' ] [10/25/2019, 3:47:03 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ] [10/25/2019, 3:47:03 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.8576007705118804, "conv_size": 2, "hidden_size": 512, "batch_size": 8, "learning_rate": 0.1}, "parameter_index": 0}' ] [10/25/2019, 3:47:07 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\"parameter_id\": 0, \"parameter_source\": \"algorithm\", \"parameters\": {\"dropout_rate\": 0.8576007705118804, \"conv_size\": 2, \"hidden_size\": 512, \"batch_size\": 8, \"learning_rate\": 0.1}, \"parameter_index\": 0}","index":0}}' ] [10/25/2019, 3:47:17 PM] INFO [ 'Trial job OYhUj status changed from WAITING to RUNNING' ] [10/25/2019, 3:49:28 PM] INFO [ 'Trial job OYhUj status changed from RUNNING to FAILED' ] [10/25/2019, 3:49:28 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.5817909748038591, "conv_size": 5, "hidden_size": 512, "batch_size": 4, "learning_rate": 0.0001}, "parameter_index": 0}' ] [10/25/2019, 3:49:33 PM] INFO [ 'submitTrialJob: form: {"sequenceId":1,"hyperParameters":{"value":"{\"parameter_id\": 1, \"parameter_source\": \"algorithm\", \"parameters\": {\"dropout_rate\": 0.5817909748038591, \"conv_size\": 5, \"hidden_size\": 512, \"batch_size\": 4, \"learning_rate\": 0.0001}, \"parameter_index\": 0}","index":0}}' ] [10/25/2019, 3:49:38 PM] INFO [ 'Trial job lsGqg status changed from WAITING to RUNNING' ]

dispatcher.log: [10/25/2019, 03:47:03 PM] INFO (nni.msg_dispatcher_base/MainThread) Start dispatcher [10/25/2019, 03:47:03 PM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002083 seconds [10/25/2019, 03:47:03 PM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [10/25/2019, 03:49:28 PM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.004800 seconds [10/25/2019, 03:49:28 PM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials [10/25/2019, 03:51:53 PM] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002802 seconds [10/25/2019, 03:51:53 PM] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials

trial logs: [10/25/2019, 03:52:06 PM] WARNING (tensorflow/MainThread) From mnist.py:151: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. [10/25/2019, 03:52:06 PM] WARNING (tensorflow/MainThread) From /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please write your own downloading logic. [10/25/2019, 03:52:06 PM] WARNING (tensorflow/MainThread) From /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: _internal_retry..wrap..wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please use urllib or similar directly.

stderr: /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From mnist.py:151: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. WARNING:tensorflow:From /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please write your own downloading logic. WARNING:tensorflow:From /home/gaochao/anaconda3/envs/gaochao/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: _internal_retry..wrap..wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please use urllib or similar directly.

I got above log just now, and the trial is failed. Sometimes, there are network error in log, for example: oserror errno 101 network is unreachable

ultmaster commented 4 years ago

@chaos0625. I'm pretty sure this is a new issue. Please open a new one if you need additional help.

Meanwhile, please check whether you can run mnist.py alone successfully. And please downgrade your tensorflow to 1.12.0 for another try.

chaos1992 commented 4 years ago

@ultmaster I have opened a new issue for help. I run mnist.py alone, but I got an error: Traceback (most recent call last): File "mnist.py", line 234, in params.update(tuner_params) TypeError: 'NoneType' object is not iterable Traceback (most recent call last): File "mnist.py", line 234, in params.update(tuner_params) TypeError: 'NoneType' object is not iterable

I'll downgrade my tensorflow and try to run "nnictl create --config nni-master/examples/trials/mnist/config.yml" again