Closed qtz93 closed 4 years ago
Hi @qtz93, seems you only set one trial in configuration, and if this trial is submitted, the experiment status will be set to NO_MORE_TRIAL . Does this trial keeps in 'running' status? Get you use kubectl get pods
command to get the pod status of trial?
@SparkSnail --ooh, Thanks!
(base) [root@test VCkc9]# kubectl get pods
NAME READY STATUS RESTARTS AGE
mnist-distributed-cpu-worker-0 0/1 Completed 0 24h
nni-exp-gpji8ybs-trial-nfuge-worker-0 0/1 Pending 0 21h
nni-exp-t3x98yhy-trial-llnfg-worker-0 0/1 Pending 0 18h
nni-exp-tkbj44vt-trial-z22wj-worker-0 0/1 Pending 0 19h
Dashboard hint of kubenetes:
nni-exp-t3x98yhy-trial-llnfg-worker-0 Pending 0 18 小时
0/4 nodes are available: 4 Insufficient memory. error
nni-exp-tkbj44vt-trial-z22wj-worker-0 Pending 0 19 小时
0/4 nodes are available: 4 Insufficient memory. error
nni-exp-gpji8ybs-trial-nfuge-worker-0 Pending 0 21 小时
0/4 nodes are available: 4 Insufficient memory.
mnist-distributed-cpu-worker-0 node2 已结束:Completed 0 1 天
(base) [root@master nni_image]# free -m
total used free shared buff/cache available
Mem: 7805 1937 594 29 5273 5258
Swap: 0 0 0
@SparkSnail Hi,can you tell me how to kill the task of Pending pods releted nni? I force delete the Pending job, it doesn't work.
(base) [root@master ~]# kubectl delete pod nni-exp-gpji8ybs-trial-nfuge-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-gpji8ybs-trial-nfuge-worker-0" force deleted
(base) [root@master ~]# kubectl delete pod nni-exp-t3x98yhy-trial-llnfg-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-t3x98yhy-trial-llnfg-worker-0" force deleted
(base) [root@master ~]# kubectl delete pod nni-exp-tkbj44vt-trial-z22wj-worker-0 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "nni-exp-tkbj44vt-trial-z22wj-worker-0" force deleted
(base) [root@master ~]# kubectl get po | grep Pending
nni-exp-gpji8ybs-trial-nfuge-worker-0 0/1 Pending 0 29s
nni-exp-t3x98yhy-trial-llnfg-worker-0 0/1 Pending 0 15s
nni-exp-tkbj44vt-trial-z22wj-worker-0 0/1 Pending 0 3s
I updated the machine's cache and then restarted the NNI experiment. Insufficient memory errors still existed. --__--
(base) [root@master ~]# free -m
total used free shared buff/cache available
Mem: 7805 1663 4526 12 1614 5725
Swap: 0 0 0
@SparkSnail Hi, I changed the value of memoryMB to 4096, and then restarted the NNI experiment. The pod state changed from Pending to ContainerCreating. It seems that there is something wrong with the configuration of NFS.
(base) [root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
mnist-distributed-cpu-worker-0 0/1 Completed 0 27h
nni-exp-gpji8ybs-trial-nfuge-worker-0 0/1 Pending 0 36m
nni-exp-opxubvp4-trial-akxgz-worker-0 0/1 Pending 0 23m
nni-exp-t3x98yhy-trial-llnfg-worker-0 0/1 Pending 0 35m
nni-exp-tkbj44vt-trial-z22wj-worker-0 0/1 Pending 0 35m
nni-exp-ym0ja7uz-trial-hz9mz-worker-0 0/1 ContainerCreating 0 12m
(base) [root@master ~]# kubectl describe pods nni-exp-ym0ja7uz-trial-hz9mz-worker-0
...
Output: Running scope as unit run-55020.scope.
mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share
Warning FailedMount 58s (x5 over 10m) kubelet, node2 Unable to mount volumes for pod "nni-exp-ym0ja7uz-trial-hz9mz-worker-0_default(6935b3ad-7316-11ea-99d8-000c295dd097)": timeout expired waiting for volumes to attach or mount for pod "default"/"nni-exp-ym0ja7uz-trial-hz9mz-worker-0". list of unmounted volumes=[nni-vol]. list of unattached volumes=[nni-vol default-token-l98vx]
...
kubenetes dashboard:
MountVolume.SetUp failed for volume "nni-vol" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/6935b3ad-7316-11ea-99d8-000c295dd097/volumes/kubernetes.io~nfs/nni-vol --scope -- mount -t nfs 172.16.xx.xx:/opt/data2/nfs_nni_share /var/lib/kubelet/pods/6935b3ad-7316-11ea-99d8-000c295dd097/volumes/kubernetes.io~nfs/nni-vol Output: Running scope as unit run-11533.scope. mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share
The exports file is configured as follows. Is there any error?
(base) [root@jaserver2 data2]# cat /etc/exports
/opt/data2/nfs_nni_share *(rw,sync,root_squash)
Hi @qtz93, you could use kubectl get tfjobs
to get all of tfjobs, and then use kubectl delete tfjob {name}
to delete the job, and job related pods will be deleted autonomically.
NNI will set your NFS server and NFS path in volume filed of the kubeflow config, https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts#L424. Seems that there is some kind of permission error while kubeflow mounting your NFS server, mount.nfs: access denied by server while mounting 172.16.xx.xx:/opt/data2/nfs_nni_share
. Have you ever tried mounting the NFS server to your local linux machine? does it work?
@SparkSnail Hi, thank you very much for your prompt answer. I have solved the problem of NFS storage, now the error message of "WATING" and "NO_MORE_TRIAL" is gone. I used to manually create a yml file and then use "kubectl delete -f tf_job_mnist.yml" to delete the invalid Pod.
Now, when I start the NNI experiment, I can see ContainerCreating. After a few minutes, the NNI interface becomes Running, but a new error appears on kubenetes.
logs-from-tensorflow-in-nni-exp-g0h6y5zv-trial-p1woh-worker-0.txt:
mkdir: cannot create directory '/tmp/mount/nni/G0H6y5Zv/p1woH/output': Permission denied
Collecting nni
Downloading https://files.pythonhosted.org/packages/ed/73/14ecec1bd9be983bf1fc310f66b540b17d8acabd651ede211bf85d57fffb/nni-1.4-py3-none-manylinux1_x86_64.whl (33.8MB)
Requirement already satisfied, skipping upgrade: astor in /usr/local/lib/python3.5/dist-packages (from nni) (0.8.1)
Requirement already satisfied, skipping upgrade: coverage in /usr/local/lib/python3.5/dist-packages (from nni) (5.0.1)
Requirement already satisfied, skipping upgrade: json-tricks in /usr/local/lib/python3.5/dist-packages (from nni) (3.13.5)
Requirement already satisfied, skipping upgrade: PythonWebHDFS in /usr/local/lib/python3.5/dist-packages (from nni) (0.2.3)
Requirement already satisfied, skipping upgrade: scipy in /usr/local/lib/python3.5/dist-packages (from nni) (1.1.0)
Requirement already satisfied, skipping upgrade: ruamel.yaml in /usr/local/lib/python3.5/dist-packages (from nni) (0.16.5)
Requirement already satisfied, skipping upgrade: hyperopt==0.1.2 in /usr/local/lib/python3.5/dist-packages (from nni) (0.1.2)
Requirement already satisfied, skipping upgrade: schema in /usr/local/lib/python3.5/dist-packages (from nni) (0.7.1)
Requirement already satisfied, skipping upgrade: requests in /usr/lib/python3/dist-packages (from nni) (2.9.1)
Requirement already satisfied, skipping upgrade: psutil in /usr/local/lib/python3.5/dist-packages (from nni) (5.6.7)
Requirement already satisfied, skipping upgrade: colorama in /usr/local/lib/python3.5/dist-packages (from nni) (0.4.3)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.5/dist-packages (from nni) (1.14.3)
Requirement already satisfied, skipping upgrade: scikit-learn<0.22,>=0.20 in /usr/local/lib/python3.5/dist-packages (from nni) (0.20.0)
Requirement already satisfied, skipping upgrade: simplejson in /usr/local/lib/python3.5/dist-packages (from PythonWebHDFS->nni) (3.17.0)
Requirement already satisfied, skipping upgrade: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.8" in /usr/local/lib/python3.5/dist-packages (from ruamel.yaml->nni) (0.2.0)
Requirement already satisfied, skipping upgrade: pymongo in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (3.10.0)
Requirement already satisfied, skipping upgrade: six in /usr/lib/python3/dist-packages (from hyperopt==0.1.2->nni) (1.10.0)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (4.41.0)
Requirement already satisfied, skipping upgrade: networkx in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (2.4)
Requirement already satisfied, skipping upgrade: future in /usr/local/lib/python3.5/dist-packages (from hyperopt==0.1.2->nni) (0.18.2)
Requirement already satisfied, skipping upgrade: contextlib2==0.5.5 in /usr/local/lib/python3.5/dist-packages (from schema->nni) (0.5.5)
Requirement already satisfied, skipping upgrade: decorator>=4.3.0 in /usr/local/lib/python3.5/dist-packages (from networkx->hyperopt==0.1.2->nni) (4.4.1)
Installing collected packages: nni
Successfully installed nni-1.4
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
/tmp/mount/nni/G0H6y5Zv/p1woH/run_worker.sh: 16: /tmp/mount/nni/G0H6y5Zv/p1woH/run_worker.sh: cannot create /tmp/mount/nni/G0H6y5Zv/p1woH/output/worker_output/trialkeeper_stdout: Directory nonexistent
Here, why is the latest version of NNI downloaded automatically? Is there a way to choose the 1.3 version? By the way, is the "/tmp/mount/nni/" directory created in a container or on nfs?
src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts:
53 constructor() {
54 this.log = getLogger();
55 this.metricsEmitter = new EventEmitter();
56 this.trialJobsMap = new Map<string, KubernetesTrialJobDetail>();
57 this.trialLocalNFSTempFolder = path.join(getExperimentRootDir(), 'trials-nfs-tmp');
58 this.experimentId = getExperimentId();
59 this.CONTAINER_MOUNT_PATH = '/tmp/mount';
60 this.genericK8sClient = new GeneralK8sClient();
61 this.logCollection = 'none';
62 }
I don’t understand this place. --__--
Tips: /opt/data2/nfs_nni_remote/nni/G0H6y5Zv/p1woH/run_worker.sh
1 #!/bin/bash
2 export NNI_PLATFORM=kubeflow
3 export NNI_SYS_DIR=$PWD/nni/p1woH
4 export NNI_OUTPUT_DIR=/tmp/mount/nni/G0H6y5Zv/p1woH/output/worker_output
5 export MULTI_PHASE=false
6 export NNI_TRIAL_JOB_ID=p1woH
7 export NNI_EXP_ID=G0H6y5Zv
8 export NNI_CODE_DIR=/tmp/mount/nni/G0H6y5Zv/p1woH
9 export NNI_TRIAL_SEQ_ID=0
10 export CUDA_VISIBLE_DEVICES=
11 mkdir -p $NNI_SYS_DIR
12 mkdir -p $NNI_OUTPUT_DIR
13 cp -rT $NNI_CODE_DIR $NNI_SYS_DIR
14 cd $NNI_SYS_DIR
15 sh install_nni.sh
16 python3 -m nni_trial_tool.trial_keeper --trial_command 'python3 mnist.py' --nnimanager_ip enp6s0f0 --nnimanager_port 46002 --nni_manager_vers ion 'v1.3' --log_collection 'none' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr
Do the directories "NNI_OUTPUT_DIR" and "NNI_CODE_DIR" refer to the directories inside the container? I entered the container and found that the contents of the tmp directory were empty.
root@d0f86fd82523:/# ll /tmp/
total 0
drwxrwxrwt 2 root root 6 Nov 8 21:44 ./
drwxr-xr-x 1 root root 18 Apr 1 01:35 ../
Hi @qtz93, the /tmp/mount
is container's path which is mounted by kubernetes for your NFS server, is specified in https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts#L442.
NNI will also mount your NFS to local machine before submitting a job, the path is ~/nni/experiments/{experimentid}/trials-nfs-tmp
https://github.com/microsoft/nni/blob/master/src/nni_manager/training_service/kubernetes/kubernetesTrainingService.ts#L294, could you please check the folder?
Hi @SparkSnail , Do you mean, is the /tmp/mount directory automatically generated by kubenetes in the container? Sorry, I don't particularly understand. The trials-nfs-tmp directory exists on my local machine (this machine is dedicated to storing NNI Trial code, independent of the 4 cluster nodes). Do I need to pull down the NNI image on each node of the cluster? Currently, I only pull the NNI image on the machine where the NNI Trial code is stored.
(base) [test@jaserver2 G0H6y5Zv]$ pwd
/home/test/nni/experiments/G0H6y5Zv
(base) [test@jaserver2 G0H6y5Zv]$ ll
total 4
drwxrwxr-x 2 test test 6 Mar 31 17:38 checkpoint
drwxrwxr-x 2 test test 24 Mar 31 18:26 db
drwxrwxr-x 2 test test 50 Apr 1 09:25 log
drwxrwxr-x 3 test test 19 Mar 31 17:38 trials-local
drwxrwxr-x 3 test test 4096 Mar 31 15:27 trials-nfs-tmp
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-local/p1woH/
install_nni.sh parameter.cfg run_worker.sh
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/
atRt27Va FX57A6Hf G0H6y5Zv OPXubVP4 t3x98YhY TKBj44Vt ym0JA7Uz ZL9W76le
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/G0H6y5Zv/
p1woH
(base) [test@jaserver2 G0H6y5Zv]$ ls trials-nfs-tmp/nni/G0H6y5Zv/p1woH/
config_assessor.yml config_kubeflow_bak.yml config_pai.yml install_nni.sh parameter.cfg search_space.json
config_bak.yml config_kubeflow.yml config_windows.yml mnist_before.py run_worker.sh tmp
config_frameworkcontroller.yml config_paiYarn.yml config.yml mnist.py search_space_bak.json
Thank you very much for answering my question in your busy schedule. ^_^
Hi @qtz93, the /tmp/mount
is created by kubernetes autonomically, and does not need to pull NNI image on the cluster, kubernetes will does that work for you.
The trials-nfs-tmp
folder is a folder which mount your NFS server, I noticed that the folder's user is test
, and your /tmp
folder's user in container is root
, and the files in trials-nfs-tmp
folder seems synchronize failed to container. Generally, the two folder should keep in same uid, could you please reset to root
user in local machine and then use NNI? the trials-nfs-tmp
folder is supposed to be root
user. Perhaps the all_squash
config will also have some impact.
@SparkSnail Thank you for your answer, I will reconfigure the root user's Python environment and try again. Thank you very much!
@SparkSnail Hello, it seems that I have got through the running process of NNI kubeflow mode. There are still some errors in NNI experiments because the nvidia graphics driver is not installed on the machine. I am going to try to make a cpu version of the nni image first. Thank you for your patience during this time!
[2020-04-02 07:20:31.050292] INFO trial_keeper_version is 1.3
[2020-04-02 07:20:31.050394] INFO nni_manager_version is 1.3
[2020-04-02 07:20:31.050426] INFO Version match!
Get exception HTTPConnectionPool(host='enp6s0f0', port=46002): Max retries exceeded with url: /api/v1/nni-pai/version/HlLqUj9q/ad4Me (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fae86776b70>: Failed to establish a new connection: [Errno -2] Name or service not known',)) when sending http post to url http://enp6s0f0:46002/api/v1/nni-pai/version/HlLqUj9q/ad4Me
[2020-04-02 07:20:31.070135] INFO Trial keeper spawns a subprocess (pid 20) to run command: ['python3', 'mnist.py']
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.5/imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist.py", line 9, in <module>
import tensorflow as tf
File "/usr/local/lib/python3.5/dist-packages/tensorflow/__init__.py", line 22, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.5/imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/install_sources#common_installation_problems
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
[2020-04-02 07:20:41.083757] INFO subprocess terminated. Exit code is 1. Quit
[2020-04-02 07:20:41.095274] INFO NNI trial keeper exit with code 1
Hi @qtz93, glad to hear that you could go through the NNI process in kubeflow mode. There are some advices:
msranni/nni
image is based on nvidia-docker, if you didn't install k8s-nvidia-plugin for kubrenetes, this image does not work. You can use any of cpu-based docker iamge, does not to build a new image, NNI will install required environment for you autonomically when the job is started.nniManagerIp
which can be accessed from the docker container, since NNI will send metrics back to nniManager process from this IP.versionCheck:false
in your config file, suggest to use latest NNI which has new features and bug fix, the latest version is v1.4.@SparkSnail Hmm, thank you for your suggestions. I will try to use version 1.4 soon, and have been following up several versions of nni. I am looking forward to NNI adding more new features and use cases, such as model compression, automatic feature engineering.
Closing the issue given the original problem is resolved. Thanks @qtz93 for the support to NNI, btw, we just released v1.5, give it a try and looking forward to your feedbacks =).
Short summary about the issue/question: After installing and deploying the 0.6 version of kubeflow on a private kubenetes cluster, I was unable to successfully run the example in kubeflow mode under mnist-tfv1. The training status always show WAITING and NO_MORE_TRIAL.
Brief what process you are following: The contents of config_kubeflow.yml file are as follows:
nnimanager.log:
How to reproduce it:
nni Environment:
need to update document(yes/no): yes
Anything else we need to know: My kubenetes cluster version is 1.14.1, kubeflow version is 0.6, and the nvidia-driver is not installed in the cluster. I can successfully run the tensorflow program on the kubenetes cluster, where the tf-operator version I use is v1. The actual cause of the error cannot be known in the log, if anyone can tell the prerequisites or relevant details of running the mnist example in kubeflow mode, it would be greatly appreciated.