kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
734 stars 177 forks source link

MPIjob Error #32

Closed xieydd closed 6 years ago

xieydd commented 6 years ago

When I run demo , i meet a problem?

Error: apiVersion "kubeflow.org/v1alpha1" in mpijob/templates/mpijob.yaml is not available

Shoud I deployment kubeflow advanced? @cheyang

cheyang commented 6 years ago

Did you deploy MPI-operator? Please check the step 7 of https://github.com/AliyunContainerService/arena/tree/master/docs/installation.

kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
xieydd commented 6 years ago

@cheyang Thanks a lot .

xieydd commented 6 years ago

When I Run MPIjob

arena submit mpi --name=mpi-dist              \
              --gpus=0              \
              --workers=2              \
              --image=horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5  \
              --syncMode=git \
              --syncSource=https://github.com/tensorflow/benchmarks.git \
              "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64     --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"

arena list can find something,but arena getget nothing

$ ./arena list                            
NAME      STATUS   TRAINER  AGE  NODE
mpi-dist  RUNNING  MPIJOB   55s  

$ ./arena get mpi-dist                    
NAME  STATUS  TRAINER  AGE  INSTANCE  NODE

@cheyang Can you help me,Thanks a lot

cheyang commented 6 years ago
  1. show me the output of kubectl get po |grep mpi-dist ?
  2. show me the output of kubectl get mpijob -o=yaml
  3. upload mpi-operator's log:

     # kubectl get po -n arena-system -o=name| grep mpi
      pod/mpi-operator-b589fbf6b-8fjw7
     # kubectl logs -n arena-system mpi-operator-b589fbf6b-8fjw7 &> /tmp/mpi-operator.log
    
xieydd commented 6 years ago

The problem is that I can create job,but there are no pod created. @cheyang

cheyang commented 6 years ago

I think it's mpi-operator issue. But I'm not able to reproduce it in my machine. Can you provide the log of mpi-operator so I can investigate? Thanks.

Can you do the following steps to collect the logs?

# kubectl get mpijob -o=yaml
 # kubectl get po -n arena-system -o=name| grep mpi
  pod/mpi-operator-b589fbf6b-8fjw7
 # kubectl logs -n arena-system mpi-operator-b589fbf6b-8fjw7 &> /tmp/mpi-operator.log
xieydd commented 6 years ago

@cheyang All Right , I will provide the log tomorrow morning ; Thanks a lot.

xieydd commented 6 years ago

@cheyang This is my log; Look as Unauthorized error

$  kubectl get mpijob -o=yaml
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
  kind: MPIJob
  metadata:
    clusterName: ""
    creationTimestamp: 2018-08-24T09:58:33Z
    generation: 1
    labels:
      app: mpijob
      chart: mpijob-0.2.0
      createdBy: MPIJob
      heritage: Tiller
      release: mpi-dist
    name: mpi-dist-mpijob
    namespace: default
    resourceVersion: "2717151"
    selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/mpi-dist-mpijob
    uid: 42db2871-a784-11e8-b49c-002590c0f788
  spec:
    BackoffLimit: 0
    launcherOnMaster: true
    replicas: 2
    template:
      metadata:
        labels:
          app: mpijob
          chart: mpijob-0.2.0
          createdBy: MPIJob
          heritage: Tiller
          release: mpi-dist
        name: mpi-dist-mpijob
      spec:
        containers:
        - command:
          - sh
          - -c
          - mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            --model resnet101 --batch_size 64     --variable_update horovod --train_dir=/training_logs
            --summary_verbosity=3 --save_summaries_steps=10
          env:
          - name: gpus
            value: "0"
          - name: workers
            value: "2"
          image: bootstrapper:5000/sextant/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
          imagePullPolicy: null
          name: mpi
          resources:
            limits: null
            requests: null
          volumeMounts:
          - mountPath: /root/code
            name: code-sync
          - mountPath: /dev/shm
            name: dshm
          workingDir: /root
        dnsPolicy: ClusterFirstWithHostNet
        hostIPC: true
        hostNetwork: true
        hostPID: true
        initContainers:
        - env:
          - name: gpus
            value: "0"
          - name: workers
            value: "2"
          - name: GIT_SYNC_REPO
            value: https://github.com/tensorflow/benchmarks.git
          - name: GIT_SYNC_DEST
            value: benchmarks
          - name: GIT_SYNC_ROOT
            value: /code
          - name: GIT_SYNC_ONE_TIME
            value: "true"
          image: registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6
          imagePullPolicy: null
          name: init-code
          volumeMounts:
          - mountPath: /code
            name: code-sync
        restartPolicy: Never
        volumes:
        - emptyDir: {}
          name: code-sync
        - emptyDir:
            medium: Memory
            sizeLimit: 2Gi
          name: dshm
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
$  kubectl get po -n arena-system -o=name| grep mpi
pod/mpi-operator-65d474df56-ctgqh
$ cat /tmp/mpi-operator.log 
W0827 07:49:24.265932       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0827 07:49:24.268296       1 mpi_job_controller.go:150] Creating event broadcaster
I0827 07:49:24.268407       1 mpi_job_controller.go:179] Setting up event handlers
I0827 07:49:24.268459       1 mpi_job_controller.go:297] Starting MPIJob controller
I0827 07:49:24.268467       1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0827 07:49:24.268836       1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0827 07:49:24.268851       1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268865       1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0827 07:49:24.268869       1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268881       1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268891       1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268905       1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268956       1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268956       1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268973       1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268977       1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268993       1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.269017       1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0827 07:49:24.268870       1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
E0827 07:49:24.275305       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Job: Unauthorized
E0827 07:49:24.276204       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.StatefulSet: Unauthorized
E0827 07:49:24.276886       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ServiceAccount: Unauthorized
E0827 07:49:24.279245       1 reflector.go:205] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to list *v1alpha1.MPIJob: Unauthorized
E0827 07:49:24.279351       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.RoleBinding: Unauthorized
E0827 07:49:24.279385       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Role: Unauthorized
E0827 07:49:24.279643       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ConfigMap: Unauthorized
$ kubectl describe clusterrole mpi-operator    
Name:         mpi-operator
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"mpi-operator","namespace":""},"rules":[{"apiGrou...
PolicyRule:
  Resources                                       Non-Resource URLs  Resource Names  Verbs
  ---------                                       -----------------  --------------  -----
  configmaps                                      []                 []              [create list watch]
  events                                          []                 []              [create patch]
  pods                                            []                 []              [get]
  pods/exec                                       []                 []              [create]
  serviceaccounts                                 []                 []              [create list watch]
  customresourcedefinitions.apiextensions.k8s.io  []                 []              [create get]
  statefulsets.apps                               []                 []              [create list update watch]
  jobs.batch                                      []                 []              [create list update watch]
  mpijobs.kubeflow.org                            []                 []              [*]
  rolebindings.rbac.authorization.k8s.io          []                 []              [create list watch]
  roles.rbac.authorization.k8s.io                 []                 []              [create list watch]
xieydd commented 6 years ago

@cheyang Can you help me . Thanks a lot.

xieydd commented 6 years ago

@cheyang I Synchronize with upstream code , and i fixed it.

$ kubectl logs mpi-operator-65d474df56-4456c -n arena-system
W0828 03:09:47.434902       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 03:09:47.437611       1 mpi_job_controller.go:150] Creating event broadcaster
I0828 03:09:47.437765       1 mpi_job_controller.go:179] Setting up event handlers
I0828 03:09:47.437868       1 mpi_job_controller.go:297] Starting MPIJob controller
I0828 03:09:47.437880       1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0828 03:09:47.438212       1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438239       1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438291       1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438313       1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438346       1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438357       1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438382       1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438410       1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438431       1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.438467       1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0828 03:09:47.438491       1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0828 03:09:47.438363       1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.439106       1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.439126       1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0828 03:09:47.597929       1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler
I0828 03:09:47.597967       1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598003       1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider
I0828 03:09:47.598031       1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner
I0828 03:09:47.598039       1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner
I0828 03:09:47.598066       1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager
I0828 03:09:47.598093       1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler
I0828 03:09:47.598105       1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598114       1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-minimal
I0828 03:09:47.598045       1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598140       1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-minimal
I0828 03:09:47.598163       1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager
I0828 03:09:47.598126       1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication-reader
I0828 03:09:47.598181       1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer
I0828 03:09:47.598196       1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider
I0828 03:09:47.607497       1 mpi_job_controller.go:726] Processing object: coredns
I0828 03:09:47.607527       1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard-settings
I0828 03:09:47.607548       1 mpi_job_controller.go:726] Processing object: tf-job-operator-config
I0828 03:09:47.607566       1 mpi_job_controller.go:726] Processing object: mpi-dist.v1
I0828 03:09:47.607581       1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication
I0828 03:09:47.608017       1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609484       1 mpi_job_controller.go:726] Processing object: kubernetes-dashboard
I0828 03:09:47.609639       1 mpi_job_controller.go:726] Processing object: heapster
I0828 03:09:47.609667       1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609685       1 mpi_job_controller.go:726] Processing object: tf-job-dashboard
I0828 03:09:47.609707       1 mpi_job_controller.go:726] Processing object: tf-job-operator
I0828 03:09:47.609718       1 mpi_job_controller.go:726] Processing object: mpi-operator
I0828 03:09:47.609726       1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609736       1 mpi_job_controller.go:726] Processing object: default
I0828 03:09:47.609745       1 mpi_job_controller.go:726] Processing object: coredns
I0828 03:09:47.609760       1 mpi_job_controller.go:726] Processing object: tiller
I0828 03:09:47.609774       1 mpi_job_controller.go:726] Processing object: jobmon
I0828 03:09:47.638061       1 shared_informer.go:122] caches populated
I0828 03:09:47.638081       1 mpi_job_controller.go:305] Starting workers
I0828 03:09:47.638095       1 mpi_job_controller.go:311] Started workers
cheyang commented 6 years ago

Sorry. I didn't get chance to take a look at it. Glad to hear that you fixed it! Thank you.

xieydd commented 6 years ago

@cheyang I have a problem, all job`s pod use hostip,why not use vip ; I have tested use calico , pod ip is not hostip, and it can run mpijob successful.

This only use one GPU, i think mpi error

[xieyd@ec-0d-9a-20-99-00 templates]$ kubectl logs mpijob-test-launcher-h6kdr
+ POD_NAME=mpijob-test-worker-1
+ shift
+ /opt/kube/kubectl exec mpijob-test-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3371302912" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-test-launcher-vmtsv,mpijob-test-worker-0,mpijob-test-worker-1@0(3)" -mca orte_hnp_uri "3371302912.0;tcp://10.99.111.171:45950" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=mpijob-test-worker-0
+ shift
+ /opt/kube/kubectl exec mpijob-test-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3371302912" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-test-launcher-vmtsv,mpijob-test-worker-0,mpijob-test-worker-1@0(3)" -mca orte_hnp_uri "3371302912.0;tcp://10.99.111.171:45950" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
error: You must be logged in to the server (Unauthorized)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mpijob-test-launcher-vmtsv
  target node:  mpijob-test-worker-0

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    mpijob-test-worker-0
  Remote host:   mpijob-test-launcher-vmtsv
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
command terminated with exit code 1
W0829 09:02:27.764325 139621931185920 tf_logging.py:125] From /root/code/rev-221558d8f76d53c41daed424ab2702a7b79f56ff/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1816: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-08-29 09:02:28.967068: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-29 09:02:31.530612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 7.37GiB
2018-08-29 09:02:31.530692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-29 09:02:32.409454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-29 09:02:32.409514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-08-29 09:02:32.409525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2018-08-29 09:02:32.409957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7102 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
I0829 09:02:34.053395 139621931185920 tf_logging.py:115] Running local_init_op.
I0829 09:02:34.441133 139621931185920 tf_logging.py:115] Done running local_init_op.
I0829 09:02:39.621228 139621931185920 tf_logging.py:115] Starting standard services.
I0829 09:02:39.732469 139621931185920 tf_logging.py:115] Starting queue runners.
I0829 09:02:39.733679 139605135312640 tf_logging.py:159] global_step/sec: 0

mpijob-test-launcher-vmtsv:49883:50195 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using internal Network Socket
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NET : Using interface eth0:10.99.111.171<0>
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.13+cuda9.0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/FMIY6JLTROGPGHIFC7VHST3MHN:/var/lib/docker/overlay2/l/YIYFHRBGEOGXJWNIZ3OLLJOMBR:/var/lib/docker/overlay2/l/OZZABZU3E2ELGTNKNU3YLU5FCC:/var/lib/docker/overlay2/l/SNRAK6TOCQRYSYU5GG5MCPP4AF:/var/lib/docker/overlay2/l/7HCTRLVOEQNUXZ22TKCO6EYHSU:/var/lib/docker/overlay2/l/VX2SKSONWB27MNUXZMZTDDWPDE:/var/lib/docker/overlay2/l/5ASBPXQNZYC5U6M2QLCRAO6ULJ:/var/lib/docker/overlay2/l/JAPXZZR2AP4OUXKUL4KTW7UJMX:/var/lib/docker/overlay2/l/XYPXTWN44E5YS'
Unexpected end of /proc/mounts line `2UXTCC4BI65QH:/var/lib/docker/overlay2/l/JRYL3A7FJKRGSON7FYPQTKMJMJ:/var/lib/docker/overlay2/l/UNTKY2BMZ7Y7L2KDXXDFMS3XGE:/var/lib/docker/overlay2/l/PC272ZPTB45AJDSFZE7LFGQLVO:/var/lib/docker/overlay2/l/LRIVCU6DRR76TETA6ZFOR5DFPM:/var/lib/docker/overlay2/l/VC27DA6K7IB6MOKJWG6QQD4HKT:/var/lib/docker/overlay2/l/IJEOCPPNN4HTBYT5Q6RMON3G3F:/var/lib/docker/overlay2/l/RLKAMT7UCGR3TJJUAQ2MKK65DB:/var/lib/docker/overlay2/l/DLRJCUNYRCATROJAX37IPWGTV6:/var/lib/docker/overlay2/l/SMIZYNU6HOKWGVVAKBMCKKEYPJ:/var/lib/do'
Unexpected end of /proc/mounts line `cker/overlay2/l/NIV2T5HMKEZBMFAIH3WNW2ZBX6:/var/lib/docker/overlay2/l/HBEHBSZLV6KUCVECN3XCQBXEAX:/var/lib/docker/overlay2/l/Q65OIHDU75CDLEF5GJGISUWUZN:/var/lib/docker/overlay2/l/BF2JUOXLTHH7ME6FTU6PGJCX73:/var/lib/docker/overlay2/l/V5FSXPLW7S733S7GHZQOKYYWEI:/var/lib/docker/overlay2/l/XYY3HCTQXSF62ZPFQK7XS5TVQ4:/var/lib/docker/overlay2/l/ZYWSF4DVUF3ZWDGOZUXRUTCAKH,upperdir=/var/lib/docker/overlay2/738067e3a663a95b9434b2055dc7e282771403aab15ffe3fd9799905cd7c8096/diff,workdir=/var/lib/docker/overlay2/738067e'
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO comm 0x7efb582f30a0 rank 0 nranks 1
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Using 256 threads
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO Min Comp Cap 6
mpijob-test-launcher-vmtsv:49883:50195 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
TensorFlow:  1.10
Model:       resnet101
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['horovod/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   horovod
==========
Generating model
Running warm up
Done warm up
Step    Img/sec total_loss
1   images/sec: 82.7 +/- 0.0 (jitter = 0.0) 9.034
10  images/sec: 46.6 +/- 6.1 (jitter = 2.5) 9.195
20  images/sec: 46.0 +/- 4.5 (jitter = 3.1) 9.146
30  images/sec: 45.9 +/- 3.5 (jitter = 1.9) 9.341
40  images/sec: 46.9 +/- 3.5 (jitter = 2.7) 9.398
50  images/sec: 49.4 +/- 4.4 (jitter = 3.9) 9.055
60  images/sec: 48.8 +/- 3.8 (jitter = 3.1) 9.127
70  images/sec: 48.4 +/- 3.4 (jitter = 3.0) 9.065
80  images/sec: 47.8 +/- 3.1 (jitter = 2.8) 9.034
90  images/sec: 47.8 +/- 2.9 (jitter = 3.1) 9.018
100 images/sec: 47.6 +/- 2.6 (jitter = 3.0) 9.126
----------------------------------------------------------------
total images/sec: 47.38
----------------------------------------------------------------
xieydd commented 6 years ago

@cheyang Sorry about that, i set useHostNetwork: false and use vip

But the error log is also appear

cheyang commented 6 years ago
  1. HostNetwork has no network overhead. And in the deep learning scenario, port conflict is not concern. According to our experience, there are always one or two pod in the same node.
  2. I suspect it's caused by the authority of the MPI launcher because I see this error:
    error: You must be logged in to the server (Unauthorized)

    Can you check the output of kubectl get role mpijob-test-launcher -o=yaml?

xieydd commented 6 years ago

@cheyang This is my log,

[xieyd@ec-0d-9a-20-99-00 ~]$ kubectl get role mpijob-test-launcher -o=yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: 2018-08-29T09:24:39Z
  labels:
    app: mpijob-test
  name: mpijob-test-launcher
  namespace: default
  ownerReferences:
  - apiVersion: kubeflow.org/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: MPIJob
    name: mpijob-test
    uid: 5aa7eb9c-ab6d-11e8-b869-ac1f6b252044
  resourceVersion: "2210618"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/default/roles/mpijob-test-launcher
  uid: 5aae7bbe-ab6d-11e8-b869-ac1f6b252044
rules:
- apiGroups:
  - ""
  resourceNames:
  - mpijob-test-worker-0
  - mpijob-test-worker-1
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - ""
  resourceNames:
  - mpijob-test-worker-0
  - mpijob-test-worker-1
  resources:
  - pods/exec
  verbs:
  - create
cheyang commented 6 years ago

From the logs, the job-launcher can launch mpijob-test-worker-1 successfully, and is able to run training. But there are communication issues between master and the node which run mpijob-test-worker-0

ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mpijob-test-launcher-vmtsv
  target node:  mpijob-test-worker-0

i suggest you should run tail -f /dev/null(for everscript) in mpijob, and try to run kubectl exec to check the network connection between master and the specified node.

xieydd commented 6 years ago

I find error. also mpi Unauthorized error; I don`t why , beacuse i have created clusterrole、clusterrolebinding、serviceaccount when i update the file mpi-operator.yaml,but also have the error.

@cheyang

[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl delete -f mpi-operator.yaml 
customresourcedefinition "mpijobs.kubeflow.org" deleted
clusterrole "mpi-operator" deleted
serviceaccount "mpi-operator" deleted
clusterrolebinding "mpi-operator" deleted
deployment "mpi-operator" deleted
[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl create -f mpi-operator.yaml 
customresourcedefinition "mpijobs.kubeflow.org" created
clusterrole "mpi-operator" created
serviceaccount "mpi-operator" created
clusterrolebinding "mpi-operator" created
[xieyd@ec-0d-9a-20-99-00 mpi-operator]$ kubectl logs mpi-operator-844cc74bd6-pxkzc   -n arena-system
W0829 12:45:46.299827       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0829 12:45:46.302060       1 mpi_job_controller.go:150] Creating event broadcaster
I0829 12:45:46.302870       1 mpi_job_controller.go:179] Setting up event handlers
I0829 12:45:46.303054       1 mpi_job_controller.go:297] Starting MPIJob controller
I0829 12:45:46.303064       1 mpi_job_controller.go:300] Waiting for informer caches to sync
I0829 12:45:46.303489       1 reflector.go:202] Starting reflector *v1.Job (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303518       1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303548       1 reflector.go:202] Starting reflector *v1alpha1.MPIJob (30s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0829 12:45:46.303565       1 reflector.go:240] Listing and watching *v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62
I0829 12:45:46.303575       1 reflector.go:202] Starting reflector *v1.ServiceAccount (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303581       1 reflector.go:202] Starting reflector *v1.RoleBinding (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303591       1 reflector.go:240] Listing and watching *v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303582       1 reflector.go:202] Starting reflector *v1.ConfigMap (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303614       1 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303597       1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303689       1 reflector.go:202] Starting reflector *v1.Role (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303716       1 reflector.go:240] Listing and watching *v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303744       1 reflector.go:202] Starting reflector *v1.StatefulSet (30s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:46.303771       1 reflector.go:240] Listing and watching *v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
E0829 12:45:46.315683       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.RoleBinding: Unauthorized
E0829 12:45:46.315749       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Job: Unauthorized
E0829 12:45:46.315917       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ConfigMap: Unauthorized
E0829 12:45:46.315768       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Role: Unauthorized
E0829 12:45:46.315824       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.StatefulSet: Unauthorized
E0829 12:45:46.315945       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ServiceAccount: Unauthorized
E0829 12:45:46.316100       1 reflector.go:205] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to list *v1alpha1.MPIJob: Unauthorized
I0829 12:45:47.316040       1 reflector.go:240] Listing and watching *v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
I0829 12:45:47.317130       1 reflector.go:240] Listing and watching *v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86
cheyang commented 6 years ago

@xieydd may I know your email address? I'd like to send email to you.

xieydd commented 6 years ago

@cheyang xieydd@gmail.com Thanks a lot .

xieydd commented 6 years ago

when i create a mpi job ,tha launch pod don`t start

@cheyang

mj-mpijob-56dfffc49c-pvp6z          1/1       Running   0          4m        0/0        10.99.147.182   ec-0d-9a-d9-bf-52
mj-mpijob-worker-0                  1/1       Running   0          4m        1/1        10.99.147.181   ec-0d-9a-d9-bf-52
mj-mpijob-worker-1                  1/1       Running   0          4m        1/1        10.99.224.104   ec-0d-9a-d9-96-c2
xieydd commented 6 years ago

@denverdino would you like to help me .

cheyang commented 6 years ago

@xieydd , sent email to you. Please check.

Eric-Zhang1990 commented 5 years ago

@cheyang @xieydd When I run example: $arena submit mpi --name=mpi-dist \ --gpus=1 \ --workers=2 \ --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ --syncMode=git \ --syncSource=https://github.com/tensorflow/benchmarks.git \ --tensorboard \ "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"

I get following state: _ _20190125092643 why 'mpi-dist-mpijob-worker-0' and 'mpi-dist-mpijob-worker-1' can not run? Please help me solve these. Thanks. _ _20190125092943 _ _20190125093014 _ _20190125093033

cheyang commented 5 years ago

You can debug by using arena get mpi-dist -e. It will show pending events.

Eric-Zhang1990 commented 5 years ago

@cheyang After I run that command, I got following info: (Note: I run arena on 192.168.110.25, which is master of k8s, 192.168.110.158 is node. ) [root@k8s-master arena]# arena get mpi-dist -e --namespace arena-system mpi-dist mpi-dist mpi-dist NAME STATUS TRAINER AGE INSTANCE NODE mpi-dist PENDING MPIJOB 1m mpi-dist-mpijob-worker-0 N/A mpi-dist RUNNING MPIJOB 1m mpi-dist-mpijob-worker-1 192.168.110.158

Your tensorboard will be available on: 192.168.110.25:31268

Events: INSTANCE TYPE AGE MESSAGE


mpi-dist-mpijob-worker-0 Normal 17m [Killing] Killing container with id docker://mpi:Need to kill Pod
mpi-dist-mpijob-worker-0 Normal 11m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 11m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 11m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 11m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 11m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 4m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 4m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 4m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 4m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 3m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 1m [Scheduled] Successfully assigned arena-system/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 1m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 1m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 1m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 1m [BackOff] Back-off restarting failed container
mpi-dist-mpijob-worker-0 Normal 6m [Scheduled] Successfully assigned default/mpi-dist-mpijob-worker-0 to 192.168.110.25
mpi-dist-mpijob-worker-0 Normal 6m [Pulled] Container image "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/git-sync:v2.0.6" already present on machine
mpi-dist-mpijob-worker-0 Normal 6m [Created] Created container
mpi-dist-mpijob-worker-0 Normal 6m [Started] Started container
mpi-dist-mpijob-worker-0 Warning 6m [BackOff] Back-off restarting failed container

_ _20190125140618

Eric-Zhang1990 commented 5 years ago

@cheyang I run kubectl logs -n arena-system mpi-operator-f49774cdc-bb2q8 &> /tmp/mpi-operator.log:

W0125 03:11:23.505801 1 client_config.go:529] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0125 03:11:23.509887 1 mpi_job_controller.go:150] Creating event broadcaster I0125 03:11:23.510170 1 mpi_job_controller.go:179] Setting up event handlers I0125 03:11:23.510485 1 mpi_job_controller.go:297] Starting MPIJob controller I0125 03:11:23.510539 1 mpi_job_controller.go:300] Waiting for informer caches to sync I0125 03:11:23.510771 1 reflector.go:202] Starting reflector v1.RoleBinding (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.510843 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.511399 1 reflector.go:202] Starting reflector v1alpha1.MPIJob (0s) from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:23.511417 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:23.511872 1 reflector.go:202] Starting reflector v1.StatefulSet (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.511895 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512268 1 reflector.go:202] Starting reflector v1.Job (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512286 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512691 1 reflector.go:202] Starting reflector v1.ConfigMap (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.512708 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.513327 1 reflector.go:202] Starting reflector v1.ServiceAccount (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.513352 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.514461 1 reflector.go:202] Starting reflector v1.Role (0s) from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.514497 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:23.535022 1 mpi_job_controller.go:726] Processing object: ffdl-lcm I0125 03:11:23.535049 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535072 1 mpi_job_controller.go:726] Processing object: kube-batchd I0125 03:11:23.535094 1 mpi_job_controller.go:726] Processing object: tille I0125 03:11:23.535130 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535162 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535184 1 mpi_job_controller.go:726] Processing object: kube-dns I0125 03:11:23.535207 1 mpi_job_controller.go:726] Processing object: tiller I0125 03:11:23.535238 1 mpi_job_controller.go:726] Processing object: tiller I0125 03:11:23.535274 1 mpi_job_controller.go:726] Processing object: default I0125 03:11:23.535292 1 mpi_job_controller.go:726] Processing object: tf-job-operator I0125 03:11:23.535310 1 mpi_job_controller.go:726] Processing object: dashboard I0125 03:11:23.535333 1 mpi_job_controller.go:726] Processing object: mpi-operator I0125 03:11:23.535364 1 mpi_job_controller.go:726] Processing object: tf-job-dashboard I0125 03:11:23.535382 1 mpi_job_controller.go:726] Processing object: jobmon I0125 03:11:23.539961 1 mpi_job_controller.go:726] Processing object: dashboard-default I0125 03:11:23.539988 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.540006 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider I0125 03:11:23.540017 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner I0125 03:11:23.540033 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager I0125 03:11:23.540051 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler I0125 03:11:23.540078 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.540101 1 mpi_job_controller.go:726] Processing object: tiller-binding I0125 03:11:23.543963 1 mpi_job_controller.go:726] Processing object: mongo I0125 03:11:23.543988 1 mpi_job_controller.go:726] Processing object: storage I0125 03:11:23.544250 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication-reader I0125 03:11:23.544282 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-controller-manager I0125 03:11:23.544319 1 mpi_job_controller.go:726] Processing object: system:controller:token-cleaner I0125 03:11:23.544359 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.544427 1 mpi_job_controller.go:726] Processing object: system::leader-locking-kube-scheduler I0125 03:11:23.544454 1 mpi_job_controller.go:726] Processing object: system:controller:bootstrap-signer I0125 03:11:23.544495 1 mpi_job_controller.go:726] Processing object: system:controller:cloud-provider I0125 03:11:23.544522 1 mpi_job_controller.go:726] Processing object: tiller-manager I0125 03:11:23.873705 1 mpi_job_controller.go:726] Processing object: kube-system.v1 I0125 03:11:23.873750 1 mpi_job_controller.go:726] Processing object: vck.v1 I0125 03:11:23.873795 1 mpi_job_controller.go:726] Processing object: viable-donkey.v1 I0125 03:11:23.873829 1 mpi_job_controller.go:726] Processing object: inky-turkey.v1 I0125 03:11:23.873856 1 mpi_job_controller.go:726] Processing object: eager-toucan.v1 I0125 03:11:23.873879 1 mpi_job_controller.go:726] Processing object: honest-cheetah.v1 I0125 03:11:23.873910 1 mpi_job_controller.go:726] Processing object: laughing-seastar.v1 I0125 03:11:23.873928 1 mpi_job_controller.go:726] Processing object: maudlin-ibis.v1 I0125 03:11:23.873955 1 mpi_job_controller.go:726] Processing object: ornery-sasquatch.v1 I0125 03:11:23.874000 1 mpi_job_controller.go:726] Processing object: static-volumes-v2 I0125 03:11:23.874032 1 mpi_job_controller.go:726] Processing object: ulterior-ferret.v1 I0125 03:11:23.874054 1 mpi_job_controller.go:726] Processing object: statsd-exporter-configmap I0125 03:11:23.874104 1 mpi_job_controller.go:726] Processing object: kissing-emu.v1 I0125 03:11:23.874149 1 mpi_job_controller.go:726] Processing object: kube-dns I0125 03:11:23.874167 1 mpi_job_controller.go:726] Processing object: tf-job-operator-config I0125 03:11:23.874194 1 mpi_job_controller.go:726] Processing object: factual-buffoon.v1 I0125 03:11:23.874230 1 mpi_job_controller.go:726] Processing object: lumbering-horse.v1 I0125 03:11:23.874257 1 mpi_job_controller.go:726] Processing object: quarrelsome-elephant.v1 I0125 03:11:23.874297 1 mpi_job_controller.go:726] Processing object: learner-config I0125 03:11:23.874347 1 mpi_job_controller.go:726] Processing object: elevated-bee.v1 I0125 03:11:23.874383 1 mpi_job_controller.go:726] Processing object: nuanced-platypus.v1 I0125 03:11:23.874410 1 mpi_job_controller.go:726] Processing object: zooming-quokka.v1 I0125 03:11:23.874446 1 mpi_job_controller.go:726] Processing object: prometheus-alertrules I0125 03:11:23.874468 1 mpi_job_controller.go:726] Processing object: learner-entrypoint-files I0125 03:11:23.874491 1 mpi_job_controller.go:726] Processing object: torpid-pronghorn.v1 I0125 03:11:23.874509 1 mpi_job_controller.go:726] Processing object: unsung-garfish.v1 I0125 03:11:23.874550 1 mpi_job_controller.go:726] Processing object: prometheus I0125 03:11:23.874576 1 mpi_job_controller.go:726] Processing object: quaffing-sparrow.v1 I0125 03:11:23.874604 1 mpi_job_controller.go:726] Processing object: terrific-jackal.v1 I0125 03:11:23.874631 1 mpi_job_controller.go:726] Processing object: static-volumes-v2 I0125 03:11:23.874653 1 mpi_job_controller.go:726] Processing object: invisible-dragon.v1 I0125 03:11:23.874711 1 mpi_job_controller.go:726] Processing object: jazzy-tuatara.v1 I0125 03:11:23.874748 1 mpi_job_controller.go:726] Processing object: juiced-beetle.v1 I0125 03:11:23.874779 1 mpi_job_controller.go:726] Processing object: garish-guppy.v1 I0125 03:11:23.874874 1 mpi_job_controller.go:726] Processing object: goodly-newt.v1 I0125 03:11:23.874941 1 mpi_job_controller.go:726] Processing object: lumpy-lionfish.v1 I0125 03:11:23.874986 1 mpi_job_controller.go:726] Processing object: extension-apiserver-authentication I0125 03:11:23.875018 1 mpi_job_controller.go:726] Processing object: cantankerous-meerkat.v1 I0125 03:11:23.875049 1 mpi_job_controller.go:726] Processing object: flailing-raccoon.v1 I0125 03:11:23.875090 1 mpi_job_controller.go:726] Processing object: hopping-mongoose.v1 I0125 03:11:23.875130 1 mpi_job_controller.go:726] Processing object: nonexistent-goat.v1 I0125 03:11:23.875153 1 mpi_job_controller.go:726] Processing object: turbulent-hummingbird.v1 I0125 03:11:23.875184 1 mpi_job_controller.go:726] Processing object: coy-mastiff.v1 I0125 03:11:23.875216 1 mpi_job_controller.go:726] Processing object: undercooked-mite.v1 I0125 03:11:23.875238 1 mpi_job_controller.go:726] Processing object: clunky-goat.v1 I0125 03:11:23.875292 1 mpi_job_controller.go:726] Processing object: static-volumes I0125 03:11:23.875315 1 mpi_job_controller.go:726] Processing object: eponymous-umbrellabird.v1 I0125 03:11:23.875378 1 mpi_job_controller.go:726] Processing object: prometheus-alertmanager I0125 03:11:23.875405 1 mpi_job_controller.go:726] Processing object: brown-wildebeest.v1 I0125 03:11:23.875427 1 mpi_job_controller.go:726] Processing object: falling-kudu.v1 I0125 03:11:23.875445 1 mpi_job_controller.go:726] Processing object: static-volumes I0125 03:11:23.875463 1 mpi_job_controller.go:726] Processing object: bailing-orangutan.v1 I0125 03:11:23.875490 1 mpi_job_controller.go:726] Processing object: littering-woodpecker.v1 I0125 03:11:23.875513 1 mpi_job_controller.go:726] Processing object: lolling-manta.v1 I0125 03:11:23.910895 1 shared_informer.go:122] caches populated I0125 03:11:23.910949 1 mpi_job_controller.go:305] Starting workers I0125 03:11:23.911008 1 mpi_job_controller.go:311] Started workers E0125 03:11:57.278512 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.279565 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received E0125 03:11:57.280649 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.280874 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received E0125 03:11:57.280923 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.281040 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received E0125 03:11:57.281830 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" E0125 03:11:57.281965 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.282055 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:11:57.282154 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received E0125 03:11:57.282909 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.283063 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received E0125 03:11:57.283278 1 reflector.go:322] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to watch v1alpha1.MPIJob: Get https://10.254.0.1:443/apis/kubeflow.org/v1alpha1/mpijobs?resourceVersion=888614&timeoutSeconds=454&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.283674 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Job: Get https://10.254.0.1:443/apis/batch/v1/jobs?resourceVersion=890102&timeoutSeconds=329&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.281843 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=27, ErrCode=NO_ERROR, debug="" I0125 03:11:57.283872 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received E0125 03:11:57.283980 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.StatefulSet: Get https://10.254.0.1:443/apis/apps/v1/statefulsets?resourceVersion=891251&timeoutSeconds=346&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284151 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ServiceAccount: Get https://10.254.0.1:443/api/v1/serviceaccounts?resourceVersion=888619&timeoutSeconds=390&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284394 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.RoleBinding: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?resourceVersion=888621&timeoutSeconds=364&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.284649 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ConfigMap: Get https://10.254.0.1:443/api/v1/configmaps?resourceVersion=890126&timeoutSeconds=544&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 03:11:57.285082 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Role: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/roles?resourceVersion=888620&timeoutSeconds=414&watch=true: dial tcp 10.254.0.1:443: connect: connection refused I0125 03:11:58.283996 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 03:11:58.285132 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.286245 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.287296 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.288441 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.289654 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:11:58.290875 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 03:13:16.945226 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 03:13:17.030641 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 03:13:17.115647 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 03:13:17.152655 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.225291 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.304548 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.347139 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 03:13:17.374148 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:17.385180 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.385535 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891601", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:17.392194 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.392271 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:17.397979 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:17.413323 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:17.413467 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:51.138803 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:13:51.159267 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:51.159434 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891622", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:13:51.170584 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:13:51.170638 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891734", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:18:08.118511 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 03:18:12.832449 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 3 items received I0125 03:18:31.776041 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 2 items received I0125 03:18:34.785574 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 03:18:42.766330 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 1 items received I0125 03:19:27.777500 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:20:30.788935 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 3 items received I0125 03:24:56.121071 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:25:01.778951 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:25:11.766910 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:26:03.823851 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:26:31.784810 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:27:49.774632 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:28:12.776145 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:31:11.841702 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:32:18.810700 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:32:48.791880 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:33:30.796087 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:34:15.145099 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:34:30.515231 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 03:34:30.534176 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 03:34:30.534324 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"28c68bdc-204f-11e9-93ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"891734", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 03:35:51.801910 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:36:34.860422 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:38:04.806577 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:38:05.805344 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:40:16.833541 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 1 items received I0125 03:40:45.167883 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:41:42.827339 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:41:57.823059 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:44:16.884442 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:44:28.817942 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:45:47.821219 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:47:22.836303 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:47:58.823310 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:48:21.818448 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:48:24.167674 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:51:16.815840 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:53:12.881663 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 03:53:50.828659 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 03:53:51.847324 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 03:55:11.825456 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 03:56:39.014595 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 03:57:26.833592 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 03:58:17.169242 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 03:59:18.935041 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:00:03.858067 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:01:44.889267 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:02:14.896443 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:05:57.882297 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:06:08.920277 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:06:18.096011 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:07:17.258668 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:07:58.015796 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:08:52.912618 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:11:43.920551 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:12:44.272603 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:14:21.900919 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:15:25.036262 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:15:54.107214 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:16:01.939574 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:18:30.930689 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:19:28.286933 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:21:29.936994 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:22:48.913339 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:23:43.118461 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:23:58.047598 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:24:15.942874 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:26:15.940009 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:28:14.292401 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:28:27.918065 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:28:30.941472 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:30:51.943274 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:33:26.038633 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:33:38.111279 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:33:52.915156 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:34:51.934198 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:36:27.291059 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:36:50.933273 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:38:57.947799 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:39:49.124212 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:40:16.052476 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:40:47.946090 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:41:32.930535 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:42:38.301045 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:44:35.947271 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:44:58.934857 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:46:13.141053 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:47:19.068080 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:47:57.963649 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:49:39.946606 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:50:23.322448 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:52:19.078264 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 04:52:42.972563 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 04:53:36.955935 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:55:22.148321 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 04:56:37.967168 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 04:56:38.949878 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 04:57:52.477410 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 04:58:44.968228 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 04:59:44.981480 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:00:20.085829 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:00:22.161406 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:01:37.978507 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:05:48.474407 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:06:12.941281 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:06:31.966665 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:07:39.151202 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:08:48.974960 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:09:37.971271 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:09:43.078155 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:10:55.481294 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:12:45.978465 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:15:02.992260 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:15:25.955314 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:15:29.989218 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:15:51.165673 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:17:40.089633 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:19:59.501152 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:20:12.002284 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:21:13.989105 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:23:06.963989 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:24:08.096488 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:24:36.170896 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:25:21.987115 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:25:57.007163 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:27:59.994137 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:28:44.505906 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:30:17.104170 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:31:23.017841 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:32:15.973257 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:32:44.183583 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:32:51.000239 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:33:07.010491 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:35:41.522278 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:38:13.118047 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 05:39:54.002493 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:40:06.983619 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:40:40.193032 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 05:40:52.021573 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 05:41:01.029681 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 05:43:08.530287 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 05:45:38.991403 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 05:45:44.551289 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:45:44.639232 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.639291 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.639805 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.639860 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.643977 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:45:44.644022 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.652551 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:45:44.652596 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:45:44.652613 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:45:44.652658 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:45:44.652672 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:46:40.029599 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 1 items received I0125 05:47:49.028205 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 05:48:00.123929 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 1 items received I0125 05:48:53.008160 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:49:05.637989 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 05:49:34.198679 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 1 items received E0125 05:49:53.881711 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.881972 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received E0125 05:49:53.881931 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.882130 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received E0125 05:49:53.882940 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883093 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received E0125 05:49:53.883025 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883197 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 1 items received E0125 05:49:53.883669 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883750 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received E0125 05:49:53.883876 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.883970 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received E0125 05:49:53.884614 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.RoleBinding: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?resourceVersion=905832&timeoutSeconds=341&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.884864 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ServiceAccount: Get https://10.254.0.1:443/api/v1/serviceaccounts?resourceVersion=905828&timeoutSeconds=394&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.885228 1 reflector.go:322] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to watch v1alpha1.MPIJob: Get https://10.254.0.1:443/apis/kubeflow.org/v1alpha1/mpijobs?resourceVersion=905825&timeoutSeconds=461&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.885255 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=353, ErrCode=NO_ERROR, debug="" I0125 05:49:53.885411 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received E0125 05:49:53.885578 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Role: Get https://10.254.0.1:443/apis/rbac.authorization.k8s.io/v1/roles?resourceVersion=905827&timeoutSeconds=471&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886032 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.StatefulSet: Get https://10.254.0.1:443/apis/apps/v1/statefulsets?resourceVersion=905829&timeoutSeconds=396&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886513 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.Job: Get https://10.254.0.1:443/apis/batch/v1/jobs?resourceVersion=891449&timeoutSeconds=347&watch=true: dial tcp 10.254.0.1:443: connect: connection refused E0125 05:49:53.886950 1 reflector.go:322] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to watch v1.ConfigMap: Get https://10.254.0.1:443/api/v1/configmaps?resourceVersion=905831&timeoutSeconds=453&watch=true: dial tcp 10.254.0.1:443: connect: connection refused I0125 05:49:54.885269 1 reflector.go:240] Listing and watching v1.RoleBinding from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.886443 1 reflector.go:240] Listing and watching v1.ServiceAccount from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.887917 1 reflector.go:240] Listing and watching v1alpha1.MPIJob from github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62 I0125 05:49:54.889366 1 reflector.go:240] Listing and watching v1.Role from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.890906 1 reflector.go:240] Listing and watching v1.StatefulSet from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.892150 1 reflector.go:240] Listing and watching v1.Job from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:49:54.893328 1 reflector.go:240] Listing and watching v1.ConfigMap from github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86 I0125 05:51:12.537696 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:51:12.626163 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:51:12.639944 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:51:12.646946 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.666341 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.666628 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.686556 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:51:12.703506 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:12.710869 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.710928 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906382", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.718900 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.719003 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.727406 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.727483 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:12.727735 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:12.743644 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:12.743892 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:44.134862 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:51:44.154651 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:44.154836 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906402", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:51:44.165880 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:51:44.166032 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"38a91898-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"906507", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:55:07.720074 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 05:55:34.997246 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:55:35.088945 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:55:35.088990 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.096724 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:55:35.099468 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.099504 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.105041 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:55:35.105091 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.106206 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.106256 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:55:35.106506 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:55:35.106549 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:56:24.952804 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:56:25.021298 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:56:25.034200 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:56:25.041586 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.060926 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.063399 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.080468 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:56:25.090011 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:56:25.097658 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.097893 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907049", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully E0125 05:56:25.103378 1 mpi_job_controller.go:372] error syncing 'default/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again I0125 05:56:25.110479 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.110546 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:56:25.118472 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:56:25.130080 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:56:25.130211 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:00.839807 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:57:00.862215 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:57:00.862305 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907070", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:00.874169 1 mpi_job_controller.go:367] Successfully synced 'default/mpi-dist-mpijob' I0125 05:57:00.874261 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"mpi-dist-mpijob", UID:"f2dd4c5f-2065-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907183", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:57:41.376102 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 7 items received I0125 05:58:01.724872 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:01.808136 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.808190 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/default/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.810033 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:01.810080 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:58:01.810107 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/default/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.818343 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.818388 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/default/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.819158 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:01.819203 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/default/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 05:58:01.824695 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:01.824740 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/default/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 05:58:14.440547 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 8 items received I0125 05:58:19.375052 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 6 items received I0125 05:58:20.377577 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 05:58:29.370781 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 4 items received I0125 05:58:38.381310 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 4 items received I0125 05:58:48.037555 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:48.104704 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 05:58:48.115972 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 05:58:48.129883 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.149367 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.150597 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.172131 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 05:58:48.178713 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:48.185522 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.185698 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907409", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully E0125 05:58:48.192252 1 mpi_job_controller.go:372] error syncing 'arena-system/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again I0125 05:58:48.197605 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.197668 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907431", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 05:58:48.201800 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 05:58:48.212278 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 05:58:48.212368 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"48260341-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907431", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:00:04.344930 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:00:04.437639 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.437684 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.437998 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:00:04.438043 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:00:04.438088 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.447608 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:00:04.447653 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.447689 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.447725 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:04.448721 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:00:04.448757 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:00:16.728595 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 15 items received I0125 06:01:16.049382 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:01:16.110685 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:01:16.126099 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:01:16.132674 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.151933 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.153260 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.169771 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:01:16.181272 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:01:16.188885 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.189078 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907753", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.194969 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.195185 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.201039 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.201097 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:01:16.208279 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:01:16.219381 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:01:16.219517 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:02:18.482720 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:02:18.504567 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:02:18.504664 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907774", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:02:18.516475 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:02:18.516556 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"a05e4c33-2066-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"907934", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:03:10.386193 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 7 items received I0125 06:04:53.383088 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 3 items received I0125 06:05:04.379249 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:05:04.450061 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 6 items received I0125 06:05:12.380160 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 3 items received I0125 06:06:20.737881 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 3 items received I0125 06:07:26.376869 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 5 items received I0125 06:10:54.451619 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 06:10:56.395679 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 06:11:51.381933 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 06:11:53.392881 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 06:12:35.388371 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:15:28.746866 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 06:15:56.386145 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 06:16:11.461189 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 0 items received I0125 06:17:26.394824 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 0 items received I0125 06:18:54.391599 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 0 items received I0125 06:20:55.397436 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.StatefulSet total 0 items received I0125 06:20:55.756612 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 0 items received I0125 06:21:12.395438 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 0 items received I0125 06:21:29.397316 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Job total 0 items received I0125 06:23:09.490450 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:09.583878 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:09.585278 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:23:09.585349 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/configmaps/mpi-dist-mpijob-config' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.585571 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.585647 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/roles/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.591902 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.591956 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/rbac.authorization.k8s.io/v1/namespaces/arena-system/rolebindings/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.593396 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:09.593437 1 mpi_job_controller.go:736] ignoring orphaned object '/api/v1/namespaces/arena-system/serviceaccounts/mpi-dist-mpijob-launcher' of mpi job 'mpi-dist-mpijob' I0125 06:23:09.595269 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:09.595321 1 mpi_job_controller.go:736] ignoring orphaned object '/apis/apps/v1/namespaces/arena-system/statefulsets/mpi-dist-mpijob-worker' of mpi job 'mpi-dist-mpijob' I0125 06:23:41.403978 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.Role total 1 items received I0125 06:23:53.485698 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:53.557122 1 mpi_job_controller.go:726] Processing object: mpi-dist.v1 I0125 06:23:53.568328 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-config I0125 06:23:53.575053 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.594692 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.594724 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.615210 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-launcher I0125 06:23:53.624980 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:53.631835 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.632069 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910120", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.638975 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.639187 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.646287 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.646359 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:23:53.654713 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:23:53.665721 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:23:53.665856 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:24:45.471173 1 reflector.go:428] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Watch close - v1alpha1.MPIJob total 3 items received I0125 06:25:42.513340 1 mpi_job_controller.go:726] Processing object: mpi-dist-mpijob-worker I0125 06:25:42.534343 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:25:42.534451 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910141", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:25:42.546436 1 mpi_job_controller.go:367] Successfully synced 'arena-system/mpi-dist-mpijob' I0125 06:25:42.546647 1 event.go:218] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"arena-system", Name:"mpi-dist-mpijob", UID:"c97809f3-2069-11e9-88ed-ac1f6b84c124", APIVersion:"kubeflow.org", ResourceVersion:"910388", FieldPath:""}): type: 'Normal' reason: 'Synced' MPIJob synced successfully I0125 06:26:18.404809 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ServiceAccount total 3 items received I0125 06:26:58.758335 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.ConfigMap total 6 items received I0125 06:28:08.400954 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - v1.RoleBinding total 2 items received I0125 06:29:08.413920 1 reflector.go:428] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Watch close - *v1.Role total 1 items received

I search the key word error and get "E0125 05:58:48.192252 1 mpi_job_controller.go:372] error syncing 'arena-system/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again" and "E0125 05:56:25.103378 1 mpi_job_controller.go:372] error syncing 'default/mpi-dist-mpijob': Operation cannot be fulfilled on mpijobs.kubeflow.org "mpi-dist-mpijob": the object has been modified; please apply your changes to the latest version and try again". Which reson can cause these errors? Thank you.

cheyang commented 5 years ago

I think the reason is that init container(git) failed. For the details, you can try kubectl logs -c init-code mpi-dist-mpijob-worker-0.

Eric-Zhang1990 commented 5 years ago

@cheyang Yes, it is the fault of git, it shows 'mpi-dist-mpijob-worker-0' (192.168.110.25, master) has git problem, but 'mpi-dist-mpijob-worker-1' (192.168.110.158, node) is ok. _ _20190128092742 How can I solve this problem? Thanks.

cheyang commented 5 years ago

As you know, the network access to github.com is not stable from China. That's why you have such issue. I think you can build docker image like what I did in https://github.com/cheyang/tensorflow-sample-code/tree/master/mpijob/docker . It does not rely on internet network.

Eric-Zhang1990 commented 5 years ago

@cheyang Thank you for your kind reply. Now issue above disappeared, but status of mpi-dist-mpijob-launcher is pending, I run command 'kubectl describe po mpi-dist-mpijob-launcher-wl9nt --namespace arena-system' get following info: Events: Type Reason Age From Message


Warning FailedScheduling 5m36s (x27 over 15m) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector. _ _20190128142655 I don't know why, do you have any hint about it? Thanks.

cheyang commented 5 years ago

It indicates that your node's label doesn't match what the job mpi-dist-mpijob-launcher-wl9nt requires. You can check by using kubectl get po mpi-dist-mpijob-launcher-wl9nt --namespace arena-system -o=yaml and kubectl get no -o=yaml.

Eric-Zhang1990 commented 5 years ago

@cheyang Thank you. I add a label name for each node, it can run now.