Running MNIST example in kubeflow mode always show WATING and NO_MORE_TRIAL

Short summary about the issue/question: After installing and deploying the 0.6 version of kubeflow on a private kubenetes cluster, I was unable to successfully run the example in kubeflow mode under mnist-tfv1. The training status always show WAITING and NO_MORE_TRIAL.

Brief what process you are following: The contents of config_kubeflow.yml file are as follows:

authorName: default
experimentName: example_dist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
nniManagerIp: "enp6s0f0"
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
trial:
  codeDir: .
  worker:
    replicas: 1
    command: python3 mnist.py
    gpuNum: 0
    cpuNum: 1
    memoryMB: 8192
    image: msranni/nni:v1.3
kubeflowConfig:
  operator: tf-operator
  apiVersion: v1
  storage: nfs
  nfs:
    server: 172.16.xx.xx
    path: /opt/data2/nfs_nni_share

nnimanager.log:

[3/30/2020, 2:00:16 PM] INFO [ 'Datastore initialization done' ]
[3/30/2020, 2:00:16 PM] INFO [ 'RestServer start' ]
[3/30/2020, 2:00:16 PM] INFO [ 'Construct Kubeflow training service.' ]
[3/30/2020, 2:00:16 PM] INFO [ 'RestServer base port is 46001' ]
[3/30/2020, 2:00:16 PM] INFO [ 'Rest server listening on: http://0.0.0.0:46001' ]
[3/30/2020, 2:00:17 PM] INFO [ 'NNIManager setClusterMetadata, key: kubeflow_config, value: {"operator":"tf-operator","apiVersion":"v1","storage":"nfs","nfs":{"server":"172.16.xx.xx","path":"/opt/data2/nfs_nni_share"}}' ]
[3/30/2020, 2:00:18 PM] INFO [ 'NNIManager setClusterMetadata, key: nni_manager_ip, value: {"nniManagerIp":"enp6s0f0"}' ]
[3/30/2020, 2:00:18 PM] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"codeDir":"/home/xxx/Programs/xxx/trials/mnist-tfv1/.","worker":{"replicas":1,"command":"python3 mnist.py","gpuNum":0,"cpuNum":1,"memoryMB":8192,"image":"msranni/nni:v1.3"}}' ]
[3/30/2020, 2:00:18 PM] INFO [ 'Starting experiment: gPJi8ybs' ]
[3/30/2020, 2:00:18 PM] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[3/30/2020, 2:00:18 PM] INFO [ 'Add event listeners' ]
[3/30/2020, 2:00:18 PM] INFO [ 'Run Kubeflow training service.' ]
[3/30/2020, 2:00:18 PM] INFO [ 'RestServer start' ]
[3/30/2020, 2:00:18 PM] INFO [ 'RestServer base port is 46002' ]
[3/30/2020, 2:00:18 PM] INFO [ 'Kubeflow Training service rest server listening on: http://0.0.0.0:46002' ]
[3/30/2020, 2:00:19 PM] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[3/30/2020, 2:00:19 PM] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"dropout_rate": 0.8905487464712047, "conv_size": 3, "hidden_size": 64, "batch_size": 16, "learning_rate": 0.001}, "parameter_index": 0}' ]
[3/30/2020, 2:00:23 PM] INFO [ 'submitTrialJob: form: {"sequenceId":0,"hyperParameters":{"value":"{\\"parameter_id\\": 0, \\"parameter_source\\": \\"algorithm\\", \\"parameters\\": {\\"dropout_rate\\": 0.8905487464712047, \\"conv_size\\": 3, \\"hidden_size\\": 64, \\"batch_size\\": 16, \\"learning_rate\\": 0.001}, \\"parameter_index\\": 0}","index":0}}' ]
[3/30/2020, 2:00:28 PM] INFO [ 'Change NNIManager status from: RUNNING to: NO_MORE_TRIAL' ]

How to reproduce it:

nni Environment:

nni version: 1.3
nni mode(local|pai|remote): kubeflow
OS: CentOS7.5
python version: 3.6.7
is conda or virtualenv used?: conda
is running in docker?: yes

need to update document(yes/no): yes

Anything else we need to know: My kubenetes cluster version is 1.14.1, kubeflow version is 0.6, and the nvidia-driver is not installed in the cluster. I can successfully run the tensorflow program on the kubenetes cluster, where the tf-operator version I use is v1. The actual cause of the error cannot be known in the log, if anyone can tell the prerequisites or relevant details of running the mnist example in kubeflow mode, it would be greatly appreciated.

Hi @qtz93, seems you only set one trial in configuration, and if this trial is submitted, the experiment status will be set to NO_MORE_TRIAL . Does this trial keeps in 'running' status? Get you use kubectl get pods command to get the pod status of trial?

@SparkSnail --ooh, Thanks!

(base) [root@test VCkc9]# kubectl get pods
NAME                                    READY   STATUS      RESTARTS   AGE
mnist-distributed-cpu-worker-0          0/1     Completed   0          24h
nni-exp-gpji8ybs-trial-nfuge-worker-0   0/1     Pending     0          21h
nni-exp-t3x98yhy-trial-llnfg-worker-0   0/1     Pending     0          18h
nni-exp-tkbj44vt-trial-z22wj-worker-0   0/1     Pending     0          19h

Dashboard hint of kubenetes:


nni-exp-t3x98yhy-trial-llnfg-worker-0                       Pending     0       18 小时
0/4 nodes are available: 4 Insufficient memory.     error
nni-exp-tkbj44vt-trial-z22wj-worker-0                       Pending     0       19 小时
0/4 nodes are available: 4 Insufficient memory.     error
nni-exp-gpji8ybs-trial-nfuge-worker-0                       Pending     0       21 小时
0/4 nodes are available: 4 Insufficient memory.     
mnist-distributed-cpu-worker-0                  node2       已结束：Completed   0   1 天

(base) [root@master nni_image]# free -m
              total        used        free      shared  buff/cache   available
Mem:           7805        1937         594          29        5273        5258
Swap:             0           0           0

@SparkSnail Hi，can you tell me how to kill the task of Pending pods releted nni? I force delete the Pending job, it doesn't work.

(base) [root@master ~]# kubectl delete pod nni-exp-gpji8ybs-trial-nfuge-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-gpji8ybs-trial-nfuge-worker-0" force deleted
(base) [root@master ~]# kubectl delete pod nni-exp-t3x98yhy-trial-llnfg-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-t3x98yhy-trial-llnfg-worker-0" force deleted
(base) [root@master ~]# kubectl delete pod nni-exp-tkbj44vt-trial-z22wj-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-tkbj44vt-trial-z22wj-worker-0" force deleted

(base) [root@master ~]# kubectl get po | grep Pending
nni-exp-gpji8ybs-trial-nfuge-worker-0   0/1     Pending     0          29s
nni-exp-t3x98yhy-trial-llnfg-worker-0   0/1     Pending     0          15s
nni-exp-tkbj44vt-trial-z22wj-worker-0   0/1     Pending     0          3s

I updated the machine's cache and then restarted the NNI experiment. Insufficient memory errors still existed. --__--

(base) [root@master ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:           7805        1663        4526          12        1614        5725
Swap:             0           0           0

@SparkSnail Hi, I changed the value of memoryMB to 4096, and then restarted the NNI experiment. The pod state changed from Pending to ContainerCreating. It seems that there is something wrong with the configuration of NFS.

(base) [root@master ~]# kubectl get pods
NAME                                    READY   STATUS              RESTARTS   AGE
mnist-distributed-cpu-worker-0          0/1     Completed           0          27h
nni-exp-gpji8ybs-trial-nfuge-worker-0   0/1     Pending             0          36m
nni-exp-opxubvp4-trial-akxgz-worker-0   0/1     Pending             0          23m
nni-exp-t3x98yhy-trial-llnfg-worker-0   0/1     Pending             0          35m
nni-exp-tkbj44vt-trial-z22wj-worker-0   0/1     Pending             0          35m
nni-exp-ym0ja7uz-trial-hz9mz-worker-0   0/1     ContainerCreating   0          12m

(base) [root@master ~]# kubectl describe pods nni-exp-ym0ja7uz-trial-hz9mz-worker-0
...
Output: Running scope as unit run-55020.scope.
mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share
  Warning  FailedMount  58s (x5 over 10m)  kubelet, node2  Unable to mount volumes for pod "nni-exp-ym0ja7uz-trial-hz9mz-worker-0_default(6935b3ad-7316-11ea-99d8-000c295dd097)": timeout expired waiting for volumes to attach or mount for pod "default"/"nni-exp-ym0ja7uz-trial-hz9mz-worker-0". list of unmounted volumes=[nni-vol]. list of unattached volumes=[nni-vol default-token-l98vx]
...

kubenetes dashboard:
MountVolume.SetUp failed for volume "nni-vol" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/6935b3ad-7316-11ea-99d8-000c295dd097/volumes/kubernetes.io~nfs/nni-vol --scope -- mount -t nfs 172.16.xx.xx:/opt/data2/nfs_nni_share /var/lib/kubelet/pods/6935b3ad-7316-11ea-99d8-000c295dd097/volumes/kubernetes.io~nfs/nni-vol Output: Running scope as unit run-11533.scope. mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share

The exports file is configured as follows. Is there any error?

(base) [root@jaserver2 data2]# cat /etc/exports
/opt/data2/nfs_nni_share *(rw,sync,root_squash)

Hi @qtz93, you could use kubectl get tfjobs to get all of tfjobs, and then use kubectl delete tfjob {name} to delete the job, and job related pods will be deleted autonomically.
NNI will set your NFS server and NFS path in volume filed of the kubeflow config, https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts#L424. Seems that there is some kind of permission error while kubeflow mounting your NFS server, mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share. Have you ever tried mounting the NFS server to your local linux machine? does it work?

@SparkSnail Hi, thank you very much for your prompt answer. I have solved the problem of NFS storage, now the error message of "WATING" and "NO_MORE_TRIAL" is gone. I used to manually create a yml file and then use "kubectl delete -f tf_job_mnist.yml" to delete the invalid Pod.

Now, when I start the NNI experiment, I can see ContainerCreating. After a few minutes, the NNI interface becomes Running, but a new error appears on kubenetes.

logs-from-tensorflow-in-nni-exp-g0h6y5zv-trial-p1woh-worker-0.txt:

mkdir: cannot create directory '/tmp/mount/nni/G0H6y5Zv/p1woH/output': Permission denied
Collecting nni
  Downloading https://files.pythonhosted.org/packages/ed/73/14ecec1bd9be983bf1fc310f66b540b17d8acabd651ede211bf85d57fffb/nni-1.4-py3-none-manylinux1_x86_64.whl (33.8MB)
Requirement already satisfied, skipping upgrade: astor in /usr/local/lib/python3.5/dist-packages (from nni) (0.8.1)
Requirement already satisfied, skipping upgrade: coverage in /usr/local/lib/python3.5/dist-packages (from nni) (5.0.1)
Requirement already satisfied, skipping upgrade: json-tricks in /usr/local/lib/python3.5/dist-packages (from nni) (3.13.5)
Requirement already satisfied, skipping upgrade: PythonWebHDFS in /usr/local/lib/python3.5/dist-packages (from nni) (0.2.3)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.5/dist-packages (from nni) (1.1.0)
Requirement already satisfied, skipping upgrade: ruamel.yaml in /usr/local/lib/python3.5/dist-packages (from nni) (0.16.5)
Requirement already satisfied, skipping upgrade: hyperopt==0.1.2 in /usr/local/lib/python3.5/dist-packages (from nni) (0.1.2)
Requirement already satisfied, skipping upgrade: schema in /usr/local/lib/python3.5/dist-packages (from nni) (0.7.1)
Requirement already satisfied, skipping upgrade: requests in /usr/lib/python3/dist-packages (from nni) (2.9.1)
Requirement already satisfied, skipping upgrade: psutil in /usr/local/lib/python3.5/dist-packages (from nni) (5.6.7)
Requirement already satisfied, skipping upgrade: colorama in /usr/local/lib/python3.5/dist-packages (from nni) (0.4.3)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.5/dist-packages (from nni) (1.14.3)
Requirement already satisfied, skipping upgrade: scikit-learn<0.22,>=0.20 in /usr/local/lib/python3.5/dist-packages (from nni) (0.20.0)
Requirement already satisfied, skipping upgrade: simplejson in /usr/local/lib/python3.5/dist-packages (from PythonWebHDFS->nni) (3.17.0)
Requirement already satisfied, skipping upgrade: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.8" in /usr/local/lib/python3.5/dist-packages (from ruamel.yaml->nni) (0.2.0)
Requirement already satisfied, skipping upgrade: pymongo in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (3.10.0)
Requirement already satisfied, skipping upgrade: six in /usr/lib/python3/dist-packages (from hyperopt==0.1.2->nni) (1.10.0)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (4.41.0)
Requirement already satisfied, skipping upgrade: networkx in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (2.4)
Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (0.18.2)
Requirement already satisfied, skipping upgrade: contextlib2==0.5.5 in /usr/local/lib/python3.5/dist-packages (from schema->nni) (0.5.5)
Requirement already satisfied, skipping upgrade: decorator>=4.3.0 in /usr/local/lib/python3.5/dist-packages (from networkx->hyperopt==0.1.2->nni) (4.4.1)
Installing collected packages: nni
Successfully installed nni-1.4
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
/tmp/mount/nni/G0H6y5Zv/p1woH/run_worker.sh: 16: /tmp/mount/nni/G0H6y5Zv/p1woH/run_worker.sh: cannot create /tmp/mount/nni/G0H6y5Zv/p1woH/output/worker_output/trialkeeper_stdout: Directory nonexistent

Here, why is the latest version of NNI downloaded automatically? Is there a way to choose the 1.3 version? By the way, is the "/tmp/mount/nni/" directory created in a container or on nfs?

src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts:

 53     constructor() {
 54         this.log = getLogger();
 55         this.metricsEmitter = new EventEmitter();
 56         this.trialJobsMap = new Map<string, KubernetesTrialJobDetail>();
 57         this.trialLocalNFSTempFolder = path.join(getExperimentRootDir(), 'trials-nfs-tmp');
 58         this.experimentId = getExperimentId();
 59         this.CONTAINER_MOUNT_PATH = '/tmp/mount';
 60         this.genericK8sClient = new GeneralK8sClient();
 61         this.logCollection = 'none';
 62     }

I don’t understand this place. --__--

Tips: /opt/data2/nfs_nni_remote/nni/G0H6y5Zv/p1woH/run_worker.sh

  1 #!/bin/bash
  2 export NNI_PLATFORM=kubeflow
  3 export NNI_SYS_DIR=$PWD/nni/p1woH
  4 export NNI_OUTPUT_DIR=/tmp/mount/nni/G0H6y5Zv/p1woH/output/worker_output
  5 export MULTI_PHASE=false
  6 export NNI_TRIAL_JOB_ID=p1woH
  7 export NNI_EXP_ID=G0H6y5Zv
  8 export NNI_CODE_DIR=/tmp/mount/nni/G0H6y5Zv/p1woH
  9 export NNI_TRIAL_SEQ_ID=0
 10 export CUDA_VISIBLE_DEVICES=
 11 mkdir -p $NNI_SYS_DIR
 12 mkdir -p $NNI_OUTPUT_DIR
 13 cp -rT $NNI_CODE_DIR $NNI_SYS_DIR
 14 cd $NNI_SYS_DIR
 15 sh install_nni.sh
 16 python3 -m nni_trial_tool.trial_keeper --trial_command 'python3 mnist.py' --nnimanager_ip enp6s0f0 --nnimanager_port 46002 --nni_manager_vers    ion 'v1.3' --log_collection 'none' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr

Do the directories "NNI_OUTPUT_DIR" and "NNI_CODE_DIR" refer to the directories inside the container? I entered the container and found that the contents of the tmp directory were empty.

root@d0f86fd82523:/# ll /tmp/
total 0
drwxrwxrwt 2 root root  6 Nov  8 21:44 ./
drwxr-xr-x 1 root root 18 Apr  1 01:35 ../

Hi @qtz93, the /tmp/mount is container's path which is mounted by kubernetes for your NFS server, is specified in https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts#L442.
NNI will also mount your NFS to local machine before submitting a job, the path is ~/nni/experiments/{experimentid}/trials-nfs-tmp https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts#L294, could you please check the folder?

Hi @SparkSnail , Do you mean, is the /tmp/mount directory automatically generated by kubenetes in the container? Sorry, I don't particularly understand. The trials-nfs-tmp directory exists on my local machine (this machine is dedicated to storing NNI Trial code, independent of the 4 cluster nodes). Do I need to pull down the NNI image on each node of the cluster? Currently, I only pull the NNI image on the machine where the NNI Trial code is stored.

(base) [test@jaserver2 G0H6y5Zv]$ pwd
/home/test/nni/experiments/G0H6y5Zv
(base) [test@jaserver2 G0H6y5Zv]$ ll
total 4
drwxrwxr-x 2 test test    6 Mar 31 17:38 checkpoint
drwxrwxr-x 2 test test   24 Mar 31 18:26 db
drwxrwxr-x 2 test test   50 Apr  1 09:25 log
drwxrwxr-x 3 test test   19 Mar 31 17:38 trials-local
drwxrwxr-x 3 test test 4096 Mar 31 15:27 trials-nfs-tmp
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-local/p1woH/
install_nni.sh  parameter.cfg  run_worker.sh
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/
atRt27Va  FX57A6Hf  G0H6y5Zv  OPXubVP4  t3x98YhY  TKBj44Vt  ym0JA7Uz  ZL9W76le
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/G0H6y5Zv/
p1woH
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/G0H6y5Zv/p1woH/
config_assessor.yml             config_kubeflow_bak.yml  config_pai.yml      install_nni.sh   parameter.cfg          search_space.json
config_bak.yml                  config_kubeflow.yml      config_windows.yml  mnist_before.py  run_worker.sh          tmp
config_frameworkcontroller.yml  config_paiYarn.yml       config.yml          mnist.py         search_space_bak.json

Thank you very much for answering my question in your busy schedule. ^_^

Hi @qtz93, the /tmp/mount is created by kubernetes autonomically, and does not need to pull NNI image on the cluster, kubernetes will does that work for you.
The trials-nfs-tmp folder is a folder which mount your NFS server, I noticed that the folder's user is test, and your /tmp folder's user in container is root, and the files in trials-nfs-tmp folder seems synchronize failed to container. Generally, the two folder should keep in same uid, could you please reset to root user in local machine and then use NNI? the trials-nfs-tmp folder is supposed to be root user. Perhaps the all_squash config will also have some impact.

@SparkSnail Thank you for your answer, I will reconfigure the root user's Python environment and try again. Thank you very much!

@SparkSnail Hello, it seems that I have got through the running process of NNI kubeflow mode. There are still some errors in NNI experiments because the nvidia graphics driver is not installed on the machine. I am going to try to make a cpu version of the nni image first. Thank you for your patience during this time!

/opt/data2/nfs_nni_share/nni/HlLqUj9q/ad4Me/output/worker_output/trialkeeper_stdout:

[2020-04-02 07:20:31.050292] INFO trial_keeper_version is 1.3
[2020-04-02 07:20:31.050394] INFO nni_manager_version is 1.3
[2020-04-02 07:20:31.050426] INFO Version match!
Get exception HTTPConnectionPool(host='enp6s0f0', port=46002): Max retries exceeded with url: /api/v1/nni-pai/version/HlLqUj9q/ad4Me (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fae86776b70>: Failed to establish a new connection: [Errno -2] Name or service not known',)) when sending http post to url http://enp6s0f0:46002/api/v1/nni-pai/version/HlLqUj9q/ad4Me
[2020-04-02 07:20:31.070135] INFO Trial keeper spawns a subprocess (pid 20) to run command: ['python3', 'mnist.py']
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mnist.py", line 9, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/__init__.py", line 22, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.
[2020-04-02 07:20:41.083757] INFO subprocess terminated. Exit code is 1. Quit
[2020-04-02 07:20:41.095274] INFO NNI trial keeper exit with code 1

Hi @qtz93, glad to hear that you could go through the NNI process in kubeflow mode. There are some advices:

The msranni/nni image is based on nvidia-docker, if you didn't install k8s-nvidia-plugin for kubrenetes, this image does not work. You can use any of cpu-based docker iamge, does not to build a new image, NNI will install required environment for you autonomically when the job is started.
You should set a valid nniManagerIp which can be accessed from the docker container, since NNI will send metrics back to nniManager process from this IP.
If NNI version in local machine and docker image does not match, you could set versionCheck:false in your config file, suggest to use latest NNI which has new features and bug fix, the latest version is v1.4.

@SparkSnail Hmm, thank you for your suggestions. I will try to use version 1.4 soon, and have been following up several versions of nni. I am looking forward to NNI adding more new features and use cases, such as model compression, automatic feature engineering.

Closing the issue given the original problem is resolved. Thanks @qtz93 for the support to NNI, btw, we just released v1.5, give it a try and looking forward to your feedbacks =).

microsoft / nni

Running MNIST example in kubeflow mode always show WATING and NO_MORE_TRIAL #2248

/opt/data2/nfs_nni_share/nni/HlLqUj9q/ad4Me/output/worker_output/trialkeeper_stdout: