Closed diggerdu closed 4 years ago
Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly. Please also paste full nnimanager.log if you can ensure the configuration is right.
For debug suggestion, While you search source code for TMP_NO_AVAILABLE_GPU , you can see that TMP_NO_AVAILABLE_GPU is only used in https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/remote_machine/gpuScheduler.ts You can check/debug the scheduleMachine method to find the root cause.
Typically TMP_NO_AVAILABLE_GPU is caused by that number of required GPUs > number of available GPUs. Could you please paste the config yaml file? let's check the configuration firstly.
authorName: sinodragon experimentName: leaf_s3 trialConcurrency: 1 maxExecDuration: 1000h maxTrialNum: 1000 trainingServicePlatform: remote debug: true # The path to Search Space searchSpacePath: search_space.json useAnnotation: false tuner: builtinTunerName: TPE classArgs: optimize_mode: minimize # The path and the running command of trial trial: command: zsh dist_train.sh 1 experiments/leaf_s3 enable_nni codeDir: ../../ gpuNum: 1
machineList:
GPU Stats
➜ leaf_s3 git:(master) ✗ nvidia-smi
Mon Dec 16 13:16:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:24:00.0 Off | 0 |
| N/A 56C P0 46W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:26:00.0 Off | 0 |
| N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Please also paste full nnimanager.log if you can ensure the configuration is right.
[12/15/2019, 2:58:11 PM] INFO [ 'Starting experiment: OS9lH9lq' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'Setup tuner...' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ] [12/15/2019, 2:58:11 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ] [12/15/2019, 2:58:11 PM] INFO [ 'Add event listeners' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 2:58:11 PM] INFO [ 'RestServer start' ] [12/15/2019, 2:58:11 PM] INFO [ 'RestServer base port is 10162' ] [12/15/2019, 2:58:11 PM] INFO [ 'Run remote machine training service.' ] [12/15/2019, 2:58:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ] [12/15/2019, 2:58:12 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ] [12/15/2019, 2:58:12 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}' ] [12/15/2019, 2:58:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 2:58:16 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}}' ] [12/15/2019, 2:58:16 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "torchadam", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 2048, "enable_std": "False", "enable_dfsmn_ln": "True", "hop_length": 2048, "act_layer": "nn.LeakyReLU(0.2)", "norm_layer": "layer_norm", "dropout_rate": 0.3695983058355199, "dilation": 1, "lookup_len": 20}, "parameter_index": 0}, jobDetail: {"id":"aHsb8","status":"WAITING","submitTime":1576393096141,"workingDirectory":"/tmp/nni/experiments/OS9lH9lq/trials/aHsb8","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"torchadam\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 2048, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"True\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"nn.LeakyReLU(0.2)\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.3695983058355199, \\"dilation\\": 1, \\"lookup_len\\": 20}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ] [12/15/2019, 2:58:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:17 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ] [12/15/2019, 2:58:17 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ] [12/15/2019, 2:58:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 2:58:20 PM] WARNING [ 'Scheduler: trialJob id aHsb8, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU ' ] [12/15/2019, 2:58:20 PM] INFO [ 'Right now no available GPU can be allocated for trial aHsb8, will try to schedule later' ]
#### After I execute ```bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector``` manually
[12/15/2019, 3:14:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 3:14:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 3:14:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ] [12/15/2019, 3:14:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ] [12/15/2019, 3:14:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 3:14:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 3:14:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ] [12/15/2019, 3:14:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ] [12/15/2019, 3:14:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ] [12/15/2019, 3:14:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ] [12/15/2019, 3:14:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
This is a bug when local system is used as remote training service.
PR https://github.com/microsoft/nni/pull/1856 is submitted to fix this. You can use local training service as a temporary workaround.
If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.
This is a bug when local system is used as remote training service. PR #1856 is submitted to fix this. You can use local training service as a temporary workaround. If you would like to use local system as remote machine, you can install nni from master branch once the PR is merged.
But the bug is still exits after I using a remote machine.
The /tmp/root/nni/scripts/gpu_metrics
does not exits in the remote machine
n108-114-238# pwd
/tmp/root/nni/scripts
n108-114-238# ls
pid
Can you paste nnimanager.log again for the real remote machine? And the output of command nnictl log stderr
Can you paste nnimanager.log again for the real remote machine? And the output of command
nnictl log stderr
[12/16/2019, 6:50:25 PM] DEBUG [ 'Database directory: /root/nni/experiments/ACQgoetz/db' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Datastore initialization done' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Construct remote machine training service.' ]
[12/16/2019, 6:50:27 PM] INFO [ 'RestServer base port is 10161' ]
[12/16/2019, 6:50:27 PM] INFO [ 'Rest server listening on: http://0.0.0.0:10161' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:30 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n "machine_list": [\n {\n "ip": "10.108.238.238",\n "username": "root",\n "passwd": "qwe",\n "port": 123\n }\n ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: machine_list, value: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Connecting to remote machines: [{"ip":"10.108.238.238","username":"root","passwd":"qwe","port":123}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [(umask 0 ; mkdir -p /tmp/root/nni/scripts)]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [chmod 777 /tmp/nni /tmp/nni/* /tmp/nni/scripts/*]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [bash -c \'echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector\']' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'PUT: /experiment/cluster-metadata: body:\n{\n "trial_config": {\n "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n "gpuNum": 1\n }\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"zsh dist_train.sh 1 experiments/debug enable_nni","codeDir":"/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../","gpuNum":1}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'POST: /experiment: body:\n{\n "authorName": "duxingjian",\n "experimentName": "debug",\n "trialConcurrency": 1,\n "maxExecDuration": 3600000,\n "maxTrialNum": 1000,\n "searchSpace": "{\\"optimizer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"torchadam\\", \\"adam_lrd?dropout=0.5\\"]}, \\"lr\\": {\\"_type\\": \\"choice\\", \\"_value\\": [0.001, 0.0001]}, \\"weight_decay\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1e-05, 0]}, \\"max_bin\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1487]}, \\"nb_layers\\": {\\"_type\\": \\"choice\\", \\"_value\\": [4]}, \\"hidden_size\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"enable_std\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"False\\"]}, \\"enable_dfsmn_ln\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"True\\", \\"False\\"]}, \\"hop_length\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1024, 2048]}, \\"act_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"nn.LeakyReLU(0.2)\\", \\"MishCuda()\\"]}, \\"norm_layer\\": {\\"_type\\": \\"choice\\", \\"_value\\": [\\"none\\", \\"layer_norm\\"]}, \\"dropout_rate\\": {\\"_type\\": \\"uniform\\", \\"_value\\": [0, 0.5]}, \\"dilation\\": {\\"_type\\": \\"choice\\", \\"_value\\": [1]}, \\"lookup_len\\": {\\"_type\\": \\"choice\\", \\"_value\\": [20, 40]}}",\n "trainingServicePlatform": "remote",\n "tuner": {\n "builtinTunerName": "TPE",\n "classArgs": {\n "optimize_mode": "minimize"\n },\n "className": "TPE"\n },\n "versionCheck": false,\n "clusterMetaData": [\n {\n "key": "machine_list",\n "value": [\n {\n "ip": "10.108.238.238",\n "username": "root",\n "passwd": "qwe",\n "port": 123\n }\n ]\n },\n {\n "key": "trial_config",\n "value": {\n "command": "zsh dist_train.sh 1 experiments/debug enable_nni",\n "codeDir": "/mnt/cephfs_new_wj/lab_speech/home/duxingjian/ume_dev_debug/experiments/debug/../../",\n "gpuNum": 1\n }\n }\n ]\n}' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Starting experiment: ACQgoetz' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Setup tuner...' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'dispatcher command: python3 -m nni --tuner_class_name TPE --tuner_args "{\\"optimize_mode\\":\\"minimize\\"}"' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Add event listeners' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'Send tuner command: INITIALIZE: {"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [IN], content:[{"optimizer": {"_type": "choice", "_value": ["torchadam", "adam_lrd?dropout=0.5"]}, "lr": {"_type": "choice", "_value": [0.001, 0.0001]}, "weight_decay": {"_type": "choice", "_value": [1e-05, 0]}, "max_bin": {"_type": "choice", "_value": [1487]}, "nb_layers": {"_type": "choice", "_value": [4]}, "hidden_size": {"_type": "choice", "_value": [1024, 2048]}, "enable_std": {"_type": "choice", "_value": ["False"]}, "enable_dfsmn_ln": {"_type": "choice", "_value": ["True", "False"]}, "hop_length": {"_type": "choice", "_value": [1024, 2048]}, "act_layer": {"_type": "choice", "_value": ["nn.LeakyReLU(0.2)", "MishCuda()"]}, "norm_layer": {"_type": "choice", "_value": ["none", "layer_norm"]}, "dropout_rate": {"_type": "uniform", "_value": [0, 0.5]}, "dilation": {"_type": "choice", "_value": [1]}, "lookup_len": {"_type": "choice", "_value": [20, 40]}}]' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer start' ]
[12/16/2019, 6:50:31 PM] INFO [ 'RestServer base port is 10162' ]
[12/16/2019, 6:50:31 PM] INFO [ 'Run remote machine training service.' ]
[12/16/2019, 6:50:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[12/16/2019, 6:50:32 PM] DEBUG [ 'ipcInterface command type: [GE], content:[1]' ]
[12/16/2019, 6:50:32 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}' ]
[12/16/2019, 6:50:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:36 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'storeTrialJobEvent: event: WAITING, data: {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"optimizer": "adam_lrd?dropout=0.5", "lr": 0.0001, "weight_decay": 1e-05, "max_bin": 1487, "nb_layers": 4, "hidden_size": 1024, "enable_std": "False", "enable_dfsmn_ln": "False", "hop_length": 2048, "act_layer": "MishCuda()", "norm_layer": "layer_norm", "dropout_rate": 0.11768302922687945, "dilation": 1, "lookup_len": 40}, "parameter_index": 0}, jobDetail: {"id":"Myh3Y","status":"WAITING","submitTime":1576493436156,"workingDirectory":"/tmp/nni/experiments/ACQgoetz/trials/Myh3Y","form":{"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"optimizer\\": \\"adam_lrd?dropout=0.5\\", \\"lr\\": 0.0001, \\"weight_decay\\": 1e-05, \\"max_bin\\": 1487, \\"nb_layers\\": 4, \\"hidden_size\\": 1024, \\"enable_std\\": \\"False\\", \\"enable_dfsmn_ln\\": \\"False\\", \\"hop_length\\": 2048, \\"act_layer\\": \\"MishCuda()\\", \\"norm_layer\\": \\"layer_norm\\", \\"dropout_rate\\": 0.11768302922687945, \\"dilation\\": 1, \\"lookup_len\\": 40}, \\"parameter_index\\": 0}","index":0}},"tags":[],"gpuIndices":[]}' ]
[12/16/2019, 6:50:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'designated gpu indices: undefined' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [mkdir -p /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/.nni]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:42 PM] DEBUG [ 'copyFileToRemote: localFilePath: /root/nni/experiments/ACQgoetz/trials-local/Myh3Y/parameter.cfg, remoteFilePath: /tmp/nni/experiments/ACQgoetz/trials/Myh3Y/parameter.cfg' ]
[12/16/2019, 6:50:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:50:55 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:50:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:50:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:05 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:15 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:18 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:20 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:21 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:22 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:23 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:24 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:25 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:26 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:27 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:28 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:29 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:30 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:31 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:32 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:33 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:34 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:35 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:36 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:51:37 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:38 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:39 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:40 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:41 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:42 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:43 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:44 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:45 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:46 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:47 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:48 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:49 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:50 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:51 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:52 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:53 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:54 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:55 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:51:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:51:57 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:51:58 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:51:59 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:00 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:01 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:02 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:03 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:04 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:05 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:06 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:07 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:08 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:09 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:10 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:11 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:12 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /check-status: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /experiment: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /trial-jobs: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /metric-data: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'GET: /version: body:\n{}' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents begin' ]
[12/16/2019, 6:52:13 PM] DEBUG [ 'getTrialJobsByReplayEvents done' ]
[12/16/2019, 6:52:14 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:15 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'ipcInterface command type: [PI], content:[]' ]
[12/16/2019, 6:52:16 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:17 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:18 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:19 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 6:52:20 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
➜ ume_dev_debug nnictl log stderr
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:898745) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
INFO:hyperopt.tpe:tpe_transform took 0.004019 seconds
INFO:hyperopt.tpe:TPE using 0 trials
(node:898745) ExperimentalWarning: The fs.promises API is experimental
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 4)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 7)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 8)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 9)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 10)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 11)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 12)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 13)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 14)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 15)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 16)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 17)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab_some/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 18)
(node:898745) UnhandledPromiseRejectionWarning: SyntaxError: Unexpected end of JSON input
at JSON.parse (<anonymous>)
at AnonymousObserver.disposable.timer.subscribe [as _onNext] (/mnt/cephfs/lab/home/sinodragon/acd3/nni/training_service/remote_machine/remoteMachineTrainingService.js:300:42)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:898745) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 19)
Can you paste nnimanager.log again for the real remote machine? And the output of command
nnictl log stderr
I print out the content of cmdresult.stdout
299 if (cmdresult !== undefined && cmdresult.stdout !== undefined) {
>>>
300 this.log.warning(cmdresult.stdout);
<<<
301 rmMeta.gpuSummary = JSON.parse(cmdresult.stdout);
302 if (rmMeta.gpuSummary.gpuCount === 0) {
303 this.log.warning(`No GPU found on remote machine ${rmMeta.ip}`);
304 this.timer.unsubscribe(disposable);
305 }
[12/16/2019, 7:08:55 PM] WARNING [ '{"Timestamp": "Mon Dec 16 19:08:52 2019", "gpuCount": 8, "gpuInfos": [{"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 0}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 1}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 2}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 3}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 4}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 5}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 6}, {"activeProcessNum": 0, "gpuMemUtil": "0", "gpuUtil": "0", "index": 7}]}\n' ]
[12/16/2019, 7:08:56 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
[12/16/2019, 7:08:56 PM] WARNING [ '' ]
[12/16/2019, 7:08:57 PM] DEBUG [ 'remoteExeCommand: command: [tail -n 1 /tmp/root/nni/scripts/gpu_metrics]' ]
Is the \n
in this json string causes the SyntaxError
?
It seems the cmdresult.stdout is correct, the \n
won't cause SyntaxError. I dont have this problem with a real remote machine on my side.
You can add some log info in RemoteMachineTrainingService.prepareTrialJob
method to check whether prepareTrialJob is finished correctly.
You can also try to clean up the remote machine environment and try again: on remote machine:
/tmp/root
, /tmp/nni
nni_gpu_tool.gpu_metrics_collector
related processAnd ensure same version of nni is installed on remote machine.
It seems the cmdresult.stdout is correct, the
\n
won't cause SyntaxError. I dont have this problem with a real remote machine on my side. You can add some log info inRemoteMachineTrainingService.prepareTrialJob
method to check whether prepareTrialJob is finished correctly.You can also try to clean up the remote machine environment and try again: on remote machine:
- remove
/tmp/root
,/tmp/nni
- kill
nni_gpu_tool.gpu_metrics_collector
related processAnd ensure same version of nni is installed on remote machine.
@diggerdu hope what @chicm-ms suggested works for you. in addition to github issue, you guys might also consider IM in NNI Gitter channel: https://gitter.im/Microsoft/nni
Short summary about the issue/question: My experiment works well on local mode, but it stucks on "'Scheduler: trialJob id xxxx, no machine can be scheduled, return TMP_NO_AVAILABLE_GPU '" while I change the mode to remotemachine
Brief what process you are following:
tail -n 1 /tmp/root/nni/scripts/gpu_metrics
manuallyfailing with error
tail: cannot open ‘/tmp/root/nni/scripts/gpu_metrics’ for reading: No such file or directory
bash -c echo $$ > /tmp/root/nni/scripts/pid ; METRIC_OUTPUT_DIR=/tmp/root/nni/scripts python3 -m nni_gpu_tool.gpu_metrics_collector
, Outputs:And previous command works flawlessly,
The NNImanager repeats
But the trials are still WAITING. Any suggestions for debuging this error?