Closed xieydd closed 6 years ago
Did you deploy MPI-operator? Please check the step 7 of https://github.com/AliyunContainerService/arena/tree/master/docs/installation.
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
@cheyang Thanks a lot .
When I Run MPIjob
arena submit mpi --name=mpi-dist \
--gpus=0 \
--workers=2 \
--image=horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--syncMode=git \
--syncSource=https://github.com/tensorflow/benchmarks.git \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
arena list
can find something,but arena get
get nothing
$ ./arena list
NAME STATUS TRAINER AGE NODE
mpi-dist RUNNING MPIJOB 55s
$ ./arena get mpi-dist
NAME STATUS TRAINER AGE INSTANCE NODE
@cheyang Can you help me,Thanks a lot
kubectl get po |grep mpi-dist
?kubectl get mpijob -o=yaml
upload mpi-operator's log:
# kubectl get po -n arena-system -o=name| grep mpi
pod/mpi-operator-b589fbf6b-8fjw7
# kubectl logs -n arena-system mpi-operator-b589fbf6b-8fjw7 &> /tmp/mpi-operator.log
The problem is that I can create job,but there are no pod created. @cheyang
I think it's mpi-operator issue. But I'm not able to reproduce it in my machine. Can you provide the log of mpi-operator so I can investigate? Thanks.
Can you do the following steps to collect the logs?
# kubectl get mpijob -o=yaml
# kubectl get po -n arena-system -o=name| grep mpi
pod/mpi-operator-b589fbf6b-8fjw7
# kubectl logs -n arena-system mpi-operator-b589fbf6b-8fjw7 &> /tmp/mpi-operator.log
@cheyang All Right , I will provide the log tomorrow morning ; Thanks a lot.
@cheyang This is my log; Look as Unauthorized error
$ kubectl get mpijob -o=yaml
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
clusterName: ""
creationTimestamp: 2018-08-24T09:58:33Z
generation: 1
labels:
app: mpijob
chart: mpijob-0.2.0
createdBy: MPIJob
heritage: Tiller
release: mpi-dist
name: mpi-dist-mpijob
namespace: default
resourceVersion: "2717151"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/mpi-dist-mpijob
uid: 42db2871-a784-11e8-b49c-002590c0f788
spec:
BackoffLimit: 0
launcherOnMaster: true
replicas: 2
template:
metadata:
labels:
app: mpijob
chart: mpijob-0.2.0
createdBy: MPIJob
heritage: Tiller
release: mpi-dist
name: mpi-dist-mpijob
spec:
containers:
- command:
- sh
- -c
- mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs
--summary_verbosity=3 --save_summaries_steps=10
env:
- name: gpus
value: "0"
- name: workers
value: "2"
image: bootstrapper:5000/sextant/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
imagePullPolicy: null
name: mpi
resources:
limits: null
requests: null
volumeMounts:
- mountPath: /root/code
name: code-sync
- mountPath: /dev/shm
name: dshm
workingDir: /root
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
hostPID: true
initContainers:
- env:
- name: gpus
value: "0"
- name: workers
value: "2"
- name: GIT_SYNC_REPO
value: https://github.com/tensorflow/benchmarks.git
- name: GIT_SYNC_DEST
value: benchmarks
- name: GIT_SYNC_ROOT
value: /code
- name: GIT_SYNC_ONE_TIME
value: "true"
image: registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6
imagePullPolicy: null
name: init-code
volumeMounts:
- mountPath: /code
name: code-sync
restartPolicy: Never
volumes:
- emptyDir: {}
name: code-sync
- emptyDir:
medium: Memory
sizeLimit: 2Gi
name: dshm
kind: List
metadata:
resourceVersion: ""
selfLink: ""
$ kubectl get po -n arena-system -o=name| grep mpi
pod/mpi-operator-65d474df56-ctgqh
$ cat /tmp/mpi-operator.log
W0827 07:49:24.265932 1 client_config.go:529] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0827 07:49:24.268296 1 mpi_job_controller.go:150] Creating event broadcaster
I0827 07:49:24.268407 1 mpi_job_controller.go:179] Setting up event handlers
I0827 07:49:24.268459 1 mpi_job_controller.go:297] Starting MPIJob controller
I0827 07:49:24.268467 1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0827 07:49:24.268836 1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0827 07:49:24.268851 1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268865 1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0827 07:49:24.268869 1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268881 1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268891 1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268905 1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268956 1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268956 1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268973 1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268977 1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268993 1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.269017 1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268870 1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
E0827 07:49:24.275305 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Job: Unauthorized
E0827 07:49:24.276204 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.StatefulSet: Unauthorized
E0827 07:49:24.276886 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ServiceAccount: Unauthorized
E0827 07:49:24.279245 1 reflector.go:205] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to list *v1alpha1.MPIJob: Unauthorized
E0827 07:49:24.279351 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.RoleBinding: Unauthorized
E0827 07:49:24.279385 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Role: Unauthorized
E0827 07:49:24.279643 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ConfigMap: Unauthorized
$ kubectl describe clusterrole mpi-operator
Name: mpi-operator
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"mpi-operator","namespace":""},"rules":[{"apiGrou...
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
configmaps [] [] [create list watch]
events [] [] [create patch]
pods [] [] [get]
pods/exec [] [] [create]
serviceaccounts [] [] [create list watch]
customresourcedefinitions.apiextensions.k8s.io [] [] [create get]
statefulsets.apps [] [] [create list update watch]
jobs.batch [] [] [create list update watch]
mpijobs.kubeflow.org [] [] [*]
rolebindings.rbac.authorization.k8s.io [] [] [create list watch]
roles.rbac.authorization.k8s.io [] [] [create list watch]
@cheyang Can you help me . Thanks a lot.
@cheyang I Synchronize with upstream code , and i fixed it.
$ kubectl logs mpi-operator-65d474df56-4456c -n arena-system
W0828 03:09:47.434902 1 client_config.go:529] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0828 03:09:47.437611 1 mpi_job_controller.go:150] Creating event broadcaster
I0828 03:09:47.437765 1 mpi_job_controller.go:179] Setting up event handlers
I0828 03:09:47.437868 1 mpi_job_controller.go:297] Starting MPIJob controller
I0828 03:09:47.437880 1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0828 03:09:47.438212 1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438239 1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438291 1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438313 1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438346 1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438357 1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438382 1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438410 1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438431 1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438467 1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0828 03:09:47.438491 1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0828 03:09:47.438363 1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.439106 1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.439126 1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.597929 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler
I0828 03:09:47.597967 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598003 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider
I0828 03:09:47.598031 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner
I0828 03:09:47.598039 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner
I0828 03:09:47.598066 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager
I0828 03:09:47.598093 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler
I0828 03:09:47.598105 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598114 1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-minimal
I0828 03:09:47.598045 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598140 1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-minimal
I0828 03:09:47.598163 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager
I0828 03:09:47.598126 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication-reader
I0828 03:09:47.598181 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598196 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider
I0828 03:09:47.607497 1 mpi_job_controller.go:726] Processing object: coredns
I0828 03:09:47.607527 1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-settings
I0828 03:09:47.607548 1 mpi_job_controller.go:726] Processing object: tf-job-operator-config
I0828 03:09:47.607566 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1
I0828 03:09:47.607581 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication
I0828 03:09:47.608017 1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609484 1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard
I0828 03:09:47.609639 1 mpi_job_controller.go:726] Processing object: heapster
I0828 03:09:47.609667 1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609685 1 mpi_job_controller.go:726] Processing object: tf-job-dashboard
I0828 03:09:47.609707 1 mpi_job_controller.go:726] Processing object: tf-job-operator
I0828 03:09:47.609718 1 mpi_job_controller.go:726] Processing object: mpi-operator
I0828 03:09:47.609726 1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609736 1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609745 1 mpi_job_controller.go:726] Processing object: coredns
I0828 03:09:47.609760 1 mpi_job_controller.go:726] Processing object: tiller
I0828 03:09:47.609774 1 mpi_job_controller.go:726] Processing object: jobmon
I0828 03:09:47.638061 1 shared_informer.go:122] caches populated
I0828 03:09:47.638081 1 mpi_job_controller.go:305] Starting workers
I0828 03:09:47.638095 1 mpi_job_controller.go:311] Started workers
Sorry. I didn't get chance to take a look at it. Glad to hear that you fixed it! Thank you.
@cheyang I have a problem, all job`s pod use hostip,why not use vip ; I have tested use calico , pod ip is not hostip, and it can run mpijob successful.
This only use one GPU, i think mpi error
[xieyd@ec-0d-9a-20-99-00 templates]$ kubectl logs mpijob-test-launcher-h6kdr
+ POD_NAME=mpijob-test-worker-1
+ shift
+ /opt/kube/kubectl exec mpijob-test-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3371302912" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-test-launcher-vmtsv,mpijob-test-worker-0,mpijob-test-worker-1@0(3)" -mca orte_hnp_uri "3371302912.0;tcp://10.99.111.171:45950" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=mpijob-test-worker-0
+ shift
+ /opt/kube/kubectl exec mpijob-test-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3371302912" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-test-launcher-vmtsv,mpijob-test-worker-0,mpijob-test-worker-1@0(3)" -mca orte_hnp_uri "3371302912.0;tcp://10.99.111.171:45950" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
error: You must be logged in to the server (Unauthorized)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: mpijob-test-launcher-vmtsv
target node: mpijob-test-worker-0
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: mpijob-test-worker-0
Remote host: mpijob-test-launcher-vmtsv
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
command terminated with exit code 1
W0829 09:02:27.764325 139621931185920 tf_logging.py:125] From /root/code/rev-221558d8f76d53c41daed424ab2702a7b79f56ff/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1816: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-08-29 09:02:28.967068: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-29 09:02:31.530612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 7.37GiB
2018-08-29 09:02:31.530692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-29 09:02:32.409454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-29 09:02:32.409514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-29 09:02:32.409525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-08-29 09:02:32.409957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7102 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
I0829 09:02:34.053395 139621931185920 tf_logging.py:115] Running local_init_op.
I0829 09:02:34.441133 139621931185920 tf_logging.py:115] Done running local_init_op.
I0829 09:02:39.621228 139621931185920 tf_logging.py:115] Starting standard services.
I0829 09:02:39.732469 139621931185920 tf_logging.py:115] Starting queue runners.
I0829 09:02:39.733679 139605135312640 tf_logging.py:159] global_step/sec: 0
mpijob-test-launcher-vmtsv:49883:50195 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using internal Network Socket
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NET : Using interface eth0:10.99.111.171<0>
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.13+cuda9.0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/FMIY6JLTROGPGHIFC7VHST3MHN:/var/lib/docker/overlay2/l/YIYFHRBGEOGXJWNIZ3OLLJOMBR:/var/lib/docker/overlay2/l/OZZABZU3E2ELGTNKNU3YLU5FCC:/var/lib/docker/overlay2/l/SNRAK6TOCQRYSYU5GG5MCPP4AF:/var/lib/docker/overlay2/l/7HCTRLVOEQNUXZ22TKCO6EYHSU:/var/lib/docker/overlay2/l/VX2SKSONWB27MNUXZMZTDDWPDE:/var/lib/docker/overlay2/l/5ASBPXQNZYC5U6M2QLCRAO6ULJ:/var/lib/docker/overlay2/l/JAPXZZR2AP4OUXKUL4KTW7UJMX:/var/lib/docker/overlay2/l/XYPXTWN44E5YS'
Unexpected end of /proc/mounts line `2UXTCC4BI65QH:/var/lib/docker/overlay2/l/JRYL3A7FJKRGSON7FYPQTKMJMJ:/var/lib/docker/overlay2/l/UNTKY2BMZ7Y7L2KDXXDFMS3XGE:/var/lib/docker/overlay2/l/PC272ZPTB45AJDSFZE7LFGQLVO:/var/lib/docker/overlay2/l/LRIVCU6DRR76TETA6ZFOR5DFPM:/var/lib/docker/overlay2/l/VC27DA6K7IB6MOKJWG6QQD4HKT:/var/lib/docker/overlay2/l/IJEOCPPNN4HTBYT5Q6RMON3G3F:/var/lib/docker/overlay2/l/RLKAMT7UCGR3TJJUAQ2MKK65DB:/var/lib/docker/overlay2/l/DLRJCUNYRCATROJAX37IPWGTV6:/var/lib/docker/overlay2/l/SMIZYNU6HOKWGVVAKBMCKKEYPJ:/var/lib/do'
Unexpected end of /proc/mounts line `cker/overlay2/l/NIV2T5HMKEZBMFAIH3WNW2ZBX6:/var/lib/docker/overlay2/l/HBEHBSZLV6KUCVECN3XCQBXEAX:/var/lib/docker/overlay2/l/Q65OIHDU75CDLEF5GJGISUWUZN:/var/lib/docker/overlay2/l/BF2JUOXLTHH7ME6FTU6PGJCX73:/var/lib/docker/overlay2/l/V5FSXPLW7S733S7GHZQOKYYWEI:/var/lib/docker/overlay2/l/XYY3HCTQXSF62ZPFQK7XS5TVQ4:/var/lib/docker/overlay2/l/ZYWSF4DVUF3ZWDGOZUXRUTCAKH,upperdir=/var/lib/docker/overlay2/738067e3a663a95b9434b2055dc7e282771403aab15ffe3fd9799905cd7c8096/diff,workdir=/var/lib/docker/overlay2/738067e'
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO comm 0x7efb582f30a0 rank 0 nranks 1
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using 256 threads
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Min Comp Cap 6
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
TensorFlow: 1.10
Model: resnet101
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['horovod/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: horovod
==========
Generating model
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 82.7 +/- 0.0 (jitter = 0.0) 9.034
10 images/sec: 46.6 +/- 6.1 (jitter = 2.5) 9.195
20 images/sec: 46.0 +/- 4.5 (jitter = 3.1) 9.146
30 images/sec: 45.9 +/- 3.5 (jitter = 1.9) 9.341
40 images/sec: 46.9 +/- 3.5 (jitter = 2.7) 9.398
50 images/sec: 49.4 +/- 4.4 (jitter = 3.9) 9.055
60 images/sec: 48.8 +/- 3.8 (jitter = 3.1) 9.127
70 images/sec: 48.4 +/- 3.4 (jitter = 3.0) 9.065
80 images/sec: 47.8 +/- 3.1 (jitter = 2.8) 9.034
90 images/sec: 47.8 +/- 2.9 (jitter = 3.1) 9.018
100 images/sec: 47.6 +/- 2.6 (jitter = 3.0) 9.126
----------------------------------------------------------------
total images/sec: 47.38
----------------------------------------------------------------
@cheyang Sorry about that, i set useHostNetwork: false and use vip
But the error log is also appear
error: You must be logged in to the server (Unauthorized)
Can you check the output of kubectl get role mpijob-test-launcher -o=yaml
?
@cheyang This is my log,
[xieyd@ec-0d-9a-20-99-00 ~]$ kubectl get role mpijob-test-launcher -o=yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
creationTimestamp: 2018-08-29T09:24:39Z
labels:
app: mpijob-test
name: mpijob-test-launcher
namespace: default
ownerReferences:
- apiVersion: kubeflow.org/v1alpha1
blockOwnerDeletion: true
controller: true
kind: MPIJob
name: mpijob-test
uid: 5aa7eb9c-ab6d-11e8-b869-ac1f6b252044
resourceVersion: "2210618"
selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/default/roles/mpijob-test-launcher
uid: 5aae7bbe-ab6d-11e8-b869-ac1f6b252044
rules:
- apiGroups:
- ""
resourceNames:
- mpijob-test-worker-0
- mpijob-test-worker-1
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resourceNames:
- mpijob-test-worker-0
- mpijob-test-worker-1
resources:
- pods/exec
verbs:
- create
From the logs, the job-launcher can launch mpijob-test-worker-1
successfully, and is able to run training. But there are communication issues between master and the node which run mpijob-test-worker-0
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: mpijob-test-launcher-vmtsv
target node: mpijob-test-worker-0
i suggest you should run tail -f /dev/null
(for everscript) in mpijob, and try to run kubectl exec
to check the network connection between master and the specified node.
I find error. also mpi Unauthorized error; I don`t why , beacuse i have created clusterrole、clusterrolebinding、serviceaccount when i update the file mpi-operator.yaml,but also have the error.
@cheyang
[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl delete -f mpi-operator.yaml
customresourcedefinition "mpijobs.kubeflow.org" deleted
clusterrole "mpi-operator" deleted
serviceaccount "mpi-operator" deleted
clusterrolebinding "mpi-operator" deleted
deployment "mpi-operator" deleted
[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl create -f mpi-operator.yaml
customresourcedefinition "mpijobs.kubeflow.org" created
clusterrole "mpi-operator" created
serviceaccount "mpi-operator" created
clusterrolebinding "mpi-operator" created
[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl logs mpi-operator-844cc74bd6-pxkzc -n arena-system
W0829 12:45:46.299827 1 client_config.go:529] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0829 12:45:46.302060 1 mpi_job_controller.go:150] Creating event broadcaster
I0829 12:45:46.302870 1 mpi_job_controller.go:179] Setting up event handlers
I0829 12:45:46.303054 1 mpi_job_controller.go:297] Starting MPIJob controller
I0829 12:45:46.303064 1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0829 12:45:46.303489 1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303518 1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303548 1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0829 12:45:46.303565 1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0829 12:45:46.303575 1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303581 1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303591 1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303582 1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303614 1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303597 1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303689 1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303716 1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303744 1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303771 1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
E0829 12:45:46.315683 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.RoleBinding: Unauthorized
E0829 12:45:46.315749 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Job: Unauthorized
E0829 12:45:46.315917 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ConfigMap: Unauthorized
E0829 12:45:46.315768 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Role: Unauthorized
E0829 12:45:46.315824 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.StatefulSet: Unauthorized
E0829 12:45:46.315945 1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ServiceAccount: Unauthorized
E0829 12:45:46.316100 1 reflector.go:205] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to list *v1alpha1.MPIJob: Unauthorized
I0829 12:45:47.316040 1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:47.317130 1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
@xieydd may I know your email address? I'd like to send email to you.
@cheyang xieydd@gmail.com Thanks a lot .
when i create a mpi job ,tha launch pod don`t start
@cheyang
mj-mpijob-56dfffc49c-pvp6z 1/1 Running 0 4m 0/0 10.99.147.182 ec-0d-9a-d9-bf-52
mj-mpijob-worker-0 1/1 Running 0 4m 1/1 10.99.147.181 ec-0d-9a-d9-bf-52
mj-mpijob-worker-1 1/1 Running 0 4m 1/1 10.99.224.104 ec-0d-9a-d9-96-c2
@denverdino would you like to help me .
@xieydd , sent email to you. Please check.
@cheyang @xieydd When I run example: $arena submit mpi --name=mpi-dist \ --gpus=1 \ --workers=2 \ --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ --syncMode=git \ --syncSource=https://github.com/tensorflow/benchmarks.git \ --tensorboard \ "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
I get following state: why 'mpi-dist-mpijob-worker-0' and 'mpi-dist-mpijob-worker-1' can not run? Please help me solve these. Thanks.
You can debug by using arena get mpi-dist -e
. It will show pending events.
@cheyang After I run that command, I got following info: (Note: I run arena on 192.168.110.25, which is master of k8s, 192.168.110.158 is node. ) [root@k8s-master arena]# arena get mpi-dist -e --namespace arena-system mpi-dist mpi-dist mpi-dist NAME STATUS TRAINER AGE INSTANCE NODE mpi-dist PENDING MPIJOB 1m mpi-dist-mpijob-worker-0 N/A mpi-dist RUNNING MPIJOB 1m mpi-dist-mpijob-worker-1 192.168.110.158
Your tensorboard will be available on: 192.168.110.25:31268
Events: INSTANCE TYPE AGE MESSAGE
mpi-dist-mpijob-worker-0 Normal 17m [Killing] Killing container with id docker://mpi:Need to kill Pod
mpi-dist-mpijob-worker-0 Normal 11m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 11m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 11m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 11m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 11m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 4m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 4m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 4m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 4m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 3m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 1m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 1m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 1m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 1m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 1m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 6m [Scheduled] Successfully assigned default/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 6m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 6m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 6m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 6m [BackOff] Back-off restarting failed container
@cheyang I run kubectl logs -n arena-system mpi-operator-f49774cdc-bb2q8 &> /tmp/mpi-operator.log:
W0125 03:11:23.505801 1 client_config.go:529] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0125 03:11:23.509887 1 mpi_job_controller.go:150] Creating event broadcaster I0125 03:11:23.510170 1 mpi_job_controller.go:179] Setting up event handlers I0125 03:11:23.510485 1 mpi_job_controller.go:297] Starting MPIJob controller I0125 03:11:23.510539 1 mpi_job_controller.go:300] Waiting for informer caches to sync I0125 03:11:23.510771 1 reflector.go:202] Starting reflector v1.RoleBinding (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.510843 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.511399 1 reflector.go:202] Starting reflector v1alpha1.MPIJob (0s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:23.511417 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:23.511872 1 reflector.go:202] Starting reflector v1.StatefulSet (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.511895 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512268 1 reflector.go:202] Starting reflector v1.Job (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512286 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512691 1 reflector.go:202] Starting reflector v1.ConfigMap (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512708 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.513327 1 reflector.go:202] Starting reflector v1.ServiceAccount (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.513352 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.514461 1 reflector.go:202] Starting reflector v1.Role (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.514497 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.535022 1 mpi_job_controller.go:726] Processing object: ffdl-lcm I0125 03:11:23.535049 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535072 1 mpi_job_controller.go:726] Processing object: kube-batchd I0125 03:11:23.535094 1 mpi_job_controller.go:726] Processing object: tille I0125 03:11:23.535130 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535162 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535184 1 mpi_job_controller.go:726] Processing object: kube-dns I0125 03:11:23.535207 1 mpi_job_controller.go:726] Processing object: tiller I0125 03:11:23.535238 1 mpi_job_controller.go:726] Processing object: tiller I0125 03:11:23.535274 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535292 1 mpi_job_controller.go:726] Processing object: tf-job-operator I0125 03:11:23.535310 1 mpi_job_controller.go:726] Processing object: dashboard I0125 03:11:23.535333 1 mpi_job_controller.go:726] Processing object: mpi-operator I0125 03:11:23.535364 1 mpi_job_controller.go:726] Processing object: tf-job-dashboard I0125 03:11:23.535382 1 mpi_job_controller.go:726] Processing object: jobmon I0125 03:11:23.539961 1 mpi_job_controller.go:726] Processing object: dashboard-default I0125 03:11:23.539988 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.540006 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider I0125 03:11:23.540017 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner I0125 03:11:23.540033 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager I0125 03:11:23.540051 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler I0125 03:11:23.540078 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.540101 1 mpi_job_controller.go:726] Processing object: tiller-binding I0125 03:11:23.543963 1 mpi_job_controller.go:726] Processing object: mongo I0125 03:11:23.543988 1 mpi_job_controller.go:726] Processing object: storage I0125 03:11:23.544250 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication-reader I0125 03:11:23.544282 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager I0125 03:11:23.544319 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner I0125 03:11:23.544359 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.544427 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler I0125 03:11:23.544454 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.544495 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider I0125 03:11:23.544522 1 mpi_job_controller.go:726] Processing object: tiller-manager I0125 03:11:23.873705 1 mpi_job_controller.go:726] Processing object: kube-system.v1 I0125 03:11:23.873750 1 mpi_job_controller.go:726] Processing object: vck.v1 I0125 03:11:23.873795 1 mpi_job_controller.go:726] Processing object: viable-donkey.v1 I0125 03:11:23.873829 1 mpi_job_controller.go:726] Processing object: inky-turkey.v1 I0125 03:11:23.873856 1 mpi_job_controller.go:726] Processing object: eager-toucan.v1 I0125 03:11:23.873879 1 mpi_job_controller.go:726] Processing object: honest-cheetah.v1 I0125 03:11:23.873910 1 mpi_job_controller.go:726] Processing object: laughing-seastar.v1 I0125 03:11:23.873928 1 mpi_job_controller.go:726] Processing object: maudlin-ibis.v1 I0125 03:11:23.873955 1 mpi_job_controller.go:726] Processing object: ornery-sasquatch.v1 I0125 03:11:23.874000 1 mpi_job_controller.go:726] Processing object: static-volumes-v2 I0125 03:11:23.874032 1 mpi_job_controller.go:726] Processing object: ulterior-ferret.v1 I0125 03:11:23.874054 1 mpi_job_controller.go:726] Processing object: statsd-exporter-configmap I0125 03:11:23.874104 1 mpi_job_controller.go:726] Processing object: kissing-emu.v1 I0125 03:11:23.874149 1 mpi_job_controller.go:726] Processing object: kube-dns I0125 03:11:23.874167 1 mpi_job_controller.go:726] Processing object: tf-job-operator-config I0125 03:11:23.874194 1 mpi_job_controller.go:726] Processing object: factual-buffoon.v1 I0125 03:11:23.874230 1 mpi_job_controller.go:726] Processing object: lumbering-horse.v1 I0125 03:11:23.874257 1 mpi_job_controller.go:726] Processing object: quarrelsome-elephant.v1 I0125 03:11:23.874297 1 mpi_job_controller.go:726] Processing object: learner-config I0125 03:11:23.874347 1 mpi_job_controller.go:726] Processing object: elevated-bee.v1 I0125 03:11:23.874383 1 mpi_job_controller.go:726] Processing object: nuanced-platypus.v1 I0125 03:11:23.874410 1 mpi_job_controller.go:726] Processing object: zooming-quokka.v1 I0125 03:11:23.874446 1 mpi_job_controller.go:726] Processing object: prometheus-alertrules I0125 03:11:23.874468 1 mpi_job_controller.go:726] Processing object: learner-entrypoint-files I0125 03:11:23.874491 1 mpi_job_controller.go:726] Processing object: torpid-pronghorn.v1 I0125 03:11:23.874509 1 mpi_job_controller.go:726] Processing object: unsung-garfish.v1 I0125 03:11:23.874550 1 mpi_job_controller.go:726] Processing object: prometheus I0125 03:11:23.874576 1 mpi_job_controller.go:726] Processing object: quaffing-sparrow.v1 I0125 03:11:23.874604 1 mpi_job_controller.go:726] Processing object: terrific-jackal.v1 I0125 03:11:23.874631 1 mpi_job_controller.go:726] Processing object: static-volumes-v2 I0125 03:11:23.874653 1 mpi_job_controller.go:726] Processing object: invisible-dragon.v1 I0125 03:11:23.874711 1 mpi_job_controller.go:726] Processing object: jazzy-tuatara.v1 I0125 03:11:23.874748 1 mpi_job_controller.go:726] Processing object: juiced-beetle.v1 I0125 03:11:23.874779 1 mpi_job_controller.go:726] Processing object: garish-guppy.v1 I0125 03:11:23.874874 1 mpi_job_controller.go:726] Processing object: goodly-newt.v1 I0125 03:11:23.874941 1 mpi_job_controller.go:726] Processing object: lumpy-lionfish.v1 I0125 03:11:23.874986 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication I0125 03:11:23.875018 1 mpi_job_controller.go:726] Processing object: cantankerous-meerkat.v1 I0125 03:11:23.875049 1 mpi_job_controller.go:726] Processing object: flailing-raccoon.v1 I0125 03:11:23.875090 1 mpi_job_controller.go:726] Processing object: hopping-mongoose.v1 I0125 03:11:23.875130 1 mpi_job_controller.go:726] Processing object: nonexistent-goat.v1 I0125 03:11:23.875153 1 mpi_job_controller.go:726] Processing object: turbulent-hummingbird.v1 I0125 03:11:23.875184 1 mpi_job_controller.go:726] Processing object: coy-mastiff.v1 I0125 03:11:23.875216 1 mpi_job_controller.go:726] Processing object: undercooked-mite.v1 I0125 03:11:23.875238 1 mpi_job_controller.go:726] Processing object: clunky-goat.v1 I0125 03:11:23.875292 1 mpi_job_controller.go:726] Processing object: static-volumes I0125 03:11:23.875315 1 mpi_job_controller.go:726] Processing object: eponymous-umbrellabird.v1 I0125 03:11:23.875378 1 mpi_job_controller.go:726] Processing object: prometheus-alertmanager I0125 03:11:23.875405 1 mpi_job_controller.go:726] Processing object: brown-wildebeest.v1 I0125 03:11:23.875427 1 mpi_job_controller.go:726] Processing object: falling-kudu.v1 I0125 03:11:23.875445 1 mpi_job_controller.go:726] Processing object: static-volumes I0125 03:11:23.875463 1 mpi_job_controller.go:726] Processing object: bailing-orangutan.v1 I0125 03:11:23.875490 1 mpi_job_controller.go:726] Processing object: littering-woodpecker.v1 I0125 03:11:23.875513 1 mpi_job_controller.go:726] Processing object: lolling-manta.v1 I0125 03:11:23.910895 1 shared_informer.go:122] caches populated I0125 03:11:23.910949 1 mpi_job_controller.go:305] Starting workers I0125 03:11:23.911008 1 mpi_job_controller.go:311] Started workers E0125 03:11:57.278512 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.279565 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received E0125 03:11:57.280649 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.280874 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received E0125 03:11:57.280923 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.281040 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received E0125 03:11:57.281830 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" E0125 03:11:57.281965 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.282055 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:11:57.282154 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received E0125 03:11:57.282909 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.283063 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received E0125 03:11:57.283278 1 reflector.go:322] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to watch v1alpha1.MPIJob: Get https://10.254.0.1:443/apis/kubeflow.org/v1alpha1/mpijobs?resourceVersion=888614&timeoutSeconds=454&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.283674 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Job: Get https://10.254.0.1:443/apis/batch/v1/jobs?resourceVersion=890102&timeoutSeconds=329&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.281843 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.283872 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received E0125 03:11:57.283980 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.StatefulSet: Get https://10.254.0.1:443/apis/apps/v1/statefulsets?resourceVersion=891251&timeoutSeconds=346&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284151 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ServiceAccount: Get https://10.254.0.1:443/api/v1/serviceaccounts?resourceVersion=888619&timeoutSeconds=390&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284394 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.RoleBinding: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?resourceVersion=888621&timeoutSeconds=364&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284649 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ConfigMap: Get https://10.254.0.1:443/api/v1/configmaps?resourceVersion=890126&timeoutSeconds=544&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.285082 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Role: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/roles?resourceVersion=888620&timeoutSeconds=414&watch=true: dial tcp 10.254.0.1:443: connect: connection refused I0125 03:11:58.283996 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:58.285132 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.286245 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.287296 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.288441 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.289654 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.290875 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:13:16.945226 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 03:13:17.030641 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 03:13:17.115647 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 03:13:17.152655 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.225291 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.304548 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.347139 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.374148 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:17.385180 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.385535 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891601", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:17.392194 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.392271 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:17.397979 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:17.413323 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.413467 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:51.138803 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:51.159267 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:51.159434 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:51.170584 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:51.170638 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891734", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:18:08.118511 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 03:18:12.832449 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 3 items received I0125 03:18:31.776041 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 2 items received I0125 03:18:34.785574 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 03:18:42.766330 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 1 items received I0125 03:19:27.777500 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:20:30.788935 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 3 items received I0125 03:24:56.121071 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:25:01.778951 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:25:11.766910 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:26:03.823851 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:26:31.784810 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:27:49.774632 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:28:12.776145 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:31:11.841702 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:32:18.810700 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:32:48.791880 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:33:30.796087 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:34:15.145099 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:34:30.515231 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:34:30.534176 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:34:30.534324 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891734", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:35:51.801910 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:36:34.860422 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:38:04.806577 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:38:05.805344 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:40:16.833541 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 1 items received I0125 03:40:45.167883 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:41:42.827339 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:41:57.823059 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:44:16.884442 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:44:28.817942 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:45:47.821219 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:47:22.836303 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:47:58.823310 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:48:21.818448 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:48:24.167674 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:51:16.815840 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:53:12.881663 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:53:50.828659 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:53:51.847324 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:55:11.825456 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:56:39.014595 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:57:26.833592 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:58:17.169242 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:59:18.935041 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:00:03.858067 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:01:44.889267 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:02:14.896443 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:05:57.882297 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:06:08.920277 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:06:18.096011 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:07:17.258668 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:07:58.015796 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:08:52.912618 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:11:43.920551 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:12:44.272603 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:14:21.900919 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:15:25.036262 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:15:54.107214 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:16:01.939574 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:18:30.930689 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:19:28.286933 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:21:29.936994 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:22:48.913339 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:23:43.118461 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:23:58.047598 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:24:15.942874 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:26:15.940009 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:28:14.292401 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:28:27.918065 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:28:30.941472 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:30:51.943274 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:33:26.038633 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:33:38.111279 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:33:52.915156 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:34:51.934198 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:36:27.291059 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:36:50.933273 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:38:57.947799 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:39:49.124212 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:40:16.052476 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:40:47.946090 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:41:32.930535 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:42:38.301045 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:44:35.947271 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:44:58.934857 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:46:13.141053 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:47:19.068080 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:47:57.963649 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:49:39.946606 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:50:23.322448 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:52:19.078264 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:52:42.972563 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:53:36.955935 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:55:22.148321 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:56:37.967168 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:56:38.949878 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:57:52.477410 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:58:44.968228 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:59:44.981480 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:00:20.085829 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:00:22.161406 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:01:37.978507 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:05:48.474407 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:06:12.941281 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:06:31.966665 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:07:39.151202 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:08:48.974960 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:09:37.971271 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:09:43.078155 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:10:55.481294 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:12:45.978465 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:15:02.992260 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:15:25.955314 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:15:29.989218 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:15:51.165673 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:17:40.089633 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:19:59.501152 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:20:12.002284 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:21:13.989105 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:23:06.963989 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:24:08.096488 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:24:36.170896 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:25:21.987115 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:25:57.007163 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:27:59.994137 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:28:44.505906 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:30:17.104170 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:31:23.017841 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:32:15.973257 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:32:44.183583 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:32:51.000239 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:33:07.010491 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:35:41.522278 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:38:13.118047 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:39:54.002493 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:40:06.983619 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:40:40.193032 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:40:52.021573 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:41:01.029681 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:43:08.530287 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:45:38.991403 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:45:44.551289 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:45:44.639232 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.639291 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.639805 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.639860 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.643977 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:45:44.644022 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.652551 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:45:44.652596 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:45:44.652613 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.652658 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.652672 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:46:40.029599 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 1 items received I0125 05:47:49.028205 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 05:48:00.123929 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 1 items received I0125 05:48:53.008160 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:49:05.637989 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 05:49:34.198679 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 1 items received E0125 05:49:53.881711 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.881972 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received E0125 05:49:53.881931 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.882130 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received E0125 05:49:53.882940 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883093 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received E0125 05:49:53.883025 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883197 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 1 items received E0125 05:49:53.883669 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883750 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received E0125 05:49:53.883876 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883970 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received E0125 05:49:53.884614 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.RoleBinding: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?resourceVersion=905832&timeoutSeconds=341&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.884864 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ServiceAccount: Get https://10.254.0.1:443/api/v1/serviceaccounts?resourceVersion=905828&timeoutSeconds=394&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.885228 1 reflector.go:322] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to watch v1alpha1.MPIJob: Get https://10.254.0.1:443/apis/kubeflow.org/v1alpha1/mpijobs?resourceVersion=905825&timeoutSeconds=461&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.885255 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.885411 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received E0125 05:49:53.885578 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Role: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/roles?resourceVersion=905827&timeoutSeconds=471&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886032 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.StatefulSet: Get https://10.254.0.1:443/apis/apps/v1/statefulsets?resourceVersion=905829&timeoutSeconds=396&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886513 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Job: Get https://10.254.0.1:443/apis/batch/v1/jobs?resourceVersion=891449&timeoutSeconds=347&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886950 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ConfigMap: Get https://10.254.0.1:443/api/v1/configmaps?resourceVersion=905831&timeoutSeconds=453&watch=true: dial tcp 10.254.0.1:443: connect: connection refused I0125 05:49:54.885269 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.886443 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.887917 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 05:49:54.889366 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.890906 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.892150 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.893328 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:51:12.537696 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:51:12.626163 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:51:12.639944 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:51:12.646946 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.666341 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.666628 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.686556 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.703506 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:12.710869 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.710928 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906382", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.718900 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.719003 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.727406 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.727483 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.727735 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:12.743644 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.743892 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:44.134862 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:44.154651 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:44.154836 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:44.165880 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:44.166032 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906507", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:55:07.720074 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 05:55:34.997246 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:55:35.088945 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:55:35.088990 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.096724 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:55:35.099468 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.099504 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.105041 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:55:35.105091 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.106206 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.106256 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.106506 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.106549 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:56:24.952804 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:56:25.021298 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:56:25.034200 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:56:25.041586 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.060926 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.063399 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.080468 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.090011 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:56:25.097658 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.097893 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907049", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully E0125 05:56:25.103378 1 mpi_job_controller.go:372] error syncing 'default/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again I0125 05:56:25.110479 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.110546 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:56:25.118472 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:56:25.130080 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.130211 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:00.839807 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:57:00.862215 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:57:00.862305 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:00.874169 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:57:00.874261 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907183", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:41.376102 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 7 items received I0125 05:58:01.724872 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:01.808136 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.808190 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/default/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.810033 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:01.810080 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:58:01.810107 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/default/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.818343 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.818388 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/default/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.819158 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.819203 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/default/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.824695 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:01.824740 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/default/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:58:14.440547 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 8 items received I0125 05:58:19.375052 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 6 items received I0125 05:58:20.377577 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:58:29.370781 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 4 items received I0125 05:58:38.381310 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 4 items received I0125 05:58:48.037555 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:48.104704 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:48.115972 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:58:48.129883 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.149367 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.150597 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.172131 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.178713 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:48.185522 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.185698 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907409", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully E0125 05:58:48.192252 1 mpi_job_controller.go:372] error syncing 'arena-system/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again I0125 05:58:48.197605 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.197668 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907431", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:58:48.201800 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:48.212278 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.212368 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907431", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:00:04.344930 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:00:04.437639 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.437684 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.437998 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:00:04.438043 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:00:04.438088 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.447608 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:00:04.447653 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.447689 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.447725 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.448721 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.448757 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:16.728595 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 15 items received I0125 06:01:16.049382 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:01:16.110685 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:01:16.126099 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:01:16.132674 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.151933 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.153260 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.169771 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.181272 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:01:16.188885 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.189078 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907753", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.194969 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.195185 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.201039 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.201097 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.208279 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:01:16.219381 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.219517 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:02:18.482720 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:02:18.504567 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:02:18.504664 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:02:18.516475 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:02:18.516556 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907934", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:03:10.386193 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 7 items received I0125 06:04:53.383088 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 3 items received I0125 06:05:04.379249 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:05:04.450061 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 6 items received I0125 06:05:12.380160 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 3 items received I0125 06:06:20.737881 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 06:07:26.376869 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 5 items received I0125 06:10:54.451619 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 06:10:56.395679 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 06:11:51.381933 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 06:11:53.392881 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 06:12:35.388371 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:15:28.746866 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 06:15:56.386145 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 06:16:11.461189 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 06:17:26.394824 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 06:18:54.391599 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 06:20:55.397436 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 06:20:55.756612 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 06:21:12.395438 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 06:21:29.397316 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:23:09.490450 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:09.583878 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:09.585278 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:23:09.585349 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.585571 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.585647 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.591902 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.591956 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.593396 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.593437 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.595269 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:09.595321 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 06:23:41.403978 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 06:23:53.485698 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:53.557122 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:53.568328 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:23:53.575053 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.594692 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.594724 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.615210 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.624980 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:53.631835 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.632069 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910120", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.638975 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.639187 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.646287 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.646359 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.654713 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:53.665721 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.665856 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:24:45.471173 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 3 items received I0125 06:25:42.513340 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:25:42.534343 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:25:42.534451 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:25:42.546436 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:25:42.546647 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910388", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:26:18.404809 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 3 items received I0125 06:26:58.758335 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 6 items received I0125 06:28:08.400954 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 2 items received I0125 06:29:08.413920 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - *v1.Role total 1 items received
I search the key word error and get "E0125 05:58:48.192252 1 mpi_job_controller.go:372] error syncing 'arena-system/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again" and "E0125 05:56:25.103378 1 mpi_job_controller.go:372] error syncing 'default/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again". Which reson can cause these errors? Thank you.
I think the reason is that init container(git) failed. For the details, you can try kubectl logs -c init-code mpi-dist-mpijob-worker-0
.
@cheyang Yes, it is the fault of git, it shows 'mpi-dist-mpijob-worker-0' (192.168.110.25, master) has git problem, but 'mpi-dist-mpijob-worker-1' (192.168.110.158, node) is ok. How can I solve this problem? Thanks.
As you know, the network access to github.com
is not stable from China. That's why you have such issue. I think you can build docker image like what I did in https://github.com/cheyang/tensorflow-sample-code/tree/master/mpijob/docker . It does not rely on internet network.
@cheyang Thank you for your kind reply. Now issue above disappeared, but status of mpi-dist-mpijob-launcher is pending, I run command 'kubectl describe po mpi-dist-mpijob-launcher-wl9nt --namespace arena-system' get following info: Events: Type Reason Age From Message
Warning FailedScheduling 5m36s (x27 over 15m) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector. I don't know why, do you have any hint about it? Thanks.
It indicates that your node's label doesn't match what the job mpi-dist-mpijob-launcher-wl9nt
requires. You can check by using kubectl get po mpi-dist-mpijob-launcher-wl9nt --namespace arena-system -o=yaml
and kubectl get no -o=yaml
.
@cheyang Thank you. I add a label name for each node, it can run now.
When I run demo , i meet a problem?
Shoud I deployment kubeflow advanced? @cheyang