openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

oc cluster up --version="v3.9" hangs on web console installation task #18913

Closed vikaschoudhary16 closed 6 years ago

vikaschoudhary16 commented 6 years ago
$ git branch
  master
* release-3.9
I0309 05:09:21.920761   64373 webconsole.go:87] instantiating web console template with parameters map[IMAGE:openshift/origin-web-console:v3.9 LOGLEVEL:0 NAMESPACE:openshift-web-console API_SERVER_CONFIG:apiVersion: webconsole.config.openshift.io/v1
clusterInfo:
  consolePublicURL: https://127.0.0.1:8443/console/
  loggingPublicURL: ""
  logoutPublicURL: ""
  masterPublicURL: https://127.0.0.1:8443
  metricsPublicURL: ""
extensions:
  properties: null
  scriptURLs: []
  stylesheetURLs: []
features:
  clusterResourceOverridesEnabled: false
  inactivityTimeoutMinutes: 0
kind: WebConsoleConfiguration
servingInfo:
  bindAddress: 0.0.0.0:8443
  bindNetwork: tcp4
  certFile: /var/serving-cert/tls.crt
  clientCA: ""
  keyFile: /var/serving-cert/tls.key
  maxRequestsInFlight: 0
  namedCertificates: null
  requestTimeoutSeconds: 0
]
I0309 05:09:22.950380   64373 webconsole.go:96] polling for web console server availability
I0309 05:09:23.950444   64373 webconsole.go:96] polling for web console server availability
I0309 05:09:24.950389   64373 webconsole.go:96] polling for web console server availability
I0309 05:09:25.950368   64373 webconsole.go:96] polling for web console server availability
I0309 05:19:21.952413   64373 webconsole.go:96] polling for web console server availability
...
...
FAIL
   Error: failed to start the web console server: timed out waiting for the condition

On the same machine, if I switch to release-3.7 it works.

Version

oc v3.9.0-alpha.4+9fd063a-531 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO

Steps To Reproduce
  1. git checkout release-3.9
  2. make
  3. export PATH=$PWD/_output/local/bin/linux/amd64:$PATH; oc cluster up --version="v3.9"
Current Result

oc cluster up fails after hanging for 10 minutes

Expected Result

No errors

Additional Information
[root@fed-node ~]# **docker exec -it origin bash**
[root@fed-node origin]# oc get namespaces
NAME                    STATUS    AGE
default                 Active    53s
kube-public             Active    53s
kube-system             Active    53s
openshift               Active    52s
openshift-infra         Active    52s
openshift-node          Active    47s
**openshift-web-console   Active    44s**
[root@fed-node origin]# oc get pods -n openshift-web-console
NAME                          READY     STATUS    RESTARTS   AGE
**webconsole-548fd9b7c4-svft2   0/1       Pending   0          56s**
[root@fed-node origin]# oc describe pod webconsole-548fd9b7c4-svft2 -n openshift-web-console
Name:           webconsole-548fd9b7c4-svft2
Namespace:      openshift-web-console
Node:           <none>
Labels:         app=openshift-web-console
                pod-template-hash=1049856370
                webconsole=true
Annotations:    openshift.io/scc=restricted
Status:         Pending
IP:             
Controlled By:  ReplicaSet/webconsole-548fd9b7c4
Containers:
  webconsole:
    Image:  openshift/origin-web-console:v3.9
    Port:   8443/TCP
    Command:
      /usr/bin/origin-web-console
      --audit-log-path=-
      -v=0
      --config=/var/webconsole-config/webconsole-config.yaml
    Requests:
      cpu:     100m
      memory:  100Mi
    Liveness:  exec [/bin/sh -i -c if [[ ! -f /tmp/webconsole-config.hash ]]; then \
  md5sum /var/webconsole-config/webconsole-config.yaml > /tmp/webconsole-config.hash; \
elif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \
  exit 1; \
fi && curl -k -f https://0.0.0.0:8443/console/] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from webconsole-token-b9vfw (ro)
      /var/serving-cert from serving-cert (rw)
      /var/webconsole-config from webconsole-config (rw)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webconsole-serving-cert
    Optional:    false
  webconsole-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      webconsole-config
    Optional:  false
  webconsole-token-b9vfw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webconsole-token-b9vfw
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  **Warning  FailedScheduling  13s (x8 over 1m)  default-scheduler  0/1 nodes are available: 1** NodeUnderDiskPressure.
[root@fed-node origin]# oc get nodes
NAME        STATUS    ROLES     AGE       VERSION
localhost   Ready     <none>    1m        v1.9.1+a0ce1bc657
[root@fed-node origin]# df -h
Filesystem      Size  Used Avail Use% Mounted on
**/dev/dm-7        10G  1.4G  8.7G  14% /**
devtmpfs        9.6G     0  9.6G   0% /dev
shm              64M     0   64M   0% /dev/shm
/dev/dm-2        99G   86G  7.4G  93% /rootfs
tmpfs           9.6G  792M  8.8G   9% /rootfs/dev/shm
tmpfs           9.6G     0  9.6G   0% /sys/fs/cgroup
tmpfs           9.6G  1.7M  9.6G   1% /run
tmpfs           2.0G   52K  2.0G   1% /run/user/1000
tmpfs           9.6G  698M  8.9G   8% /rootfs/tmp
/dev/sda1       976M  180M  730M  20% /rootfs/boot
/dev/dm-5       356G   92G  246G  28% /rootfs/home
[root@fed-node origin]# oc get nodes
NAME        STATUS    ROLES     AGE       VERSION
localhost   Ready     <none>    2m        v1.9.1+a0ce1bc657
$ netstat -apn | grep 8443
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      72685/openshift     
tcp        0      0 127.0.0.1:33870         127.0.0.1:8443          ESTABLISHED 71907/oc            
tcp        0      0 127.0.0.1:8443          127.0.0.1:33858         ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:8443          127.0.0.1:33888         ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:33832         127.0.0.1:8443          ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:8443          127.0.0.1:33916         ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:33916         127.0.0.1:8443          ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:33888         127.0.0.1:8443          ESTABLISHED 71907/oc            
tcp        0      0 127.0.0.1:8443          127.0.0.1:33870         ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:33858         127.0.0.1:8443          ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:8443          127.0.0.1:33622         ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:33586         127.0.0.1:8443          TIME_WAIT   -                   
tcp        0      0 127.0.0.1:33622         127.0.0.1:8443          ESTABLISHED 72685/openshift     
tcp        0      0 127.0.0.1:8443          127.0.0.1:33832         ESTABLISHED 72685/openshift   
$ docker info
Containers: 16
 Running: 1
 Paused: 0
 Stopped: 15
Images: 2
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: docker-init
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: N/A (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  WARNING: You're not using the default seccomp profile
  Profile: /etc/docker/seccomp.json
 selinux
Kernel Version: 3.10.0-858.el7.x86_64
Operating System: 3scale
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 16
Total Memory: 125.7 GiB
Name: dell-r620-01.perf.lab.eng.rdu.redhat.com
ID: FPT3:4D74:OEIJ:MQFS:7DCR:34CQ:VECA:V2OL:DOHE:SYAP:63TF:ZXGN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Experimental: false
Insecure Registries:
 172.30.0.0/16
 127.0.0.0/8
Live Restore Enabled: false
Registries: registry.access.redhat.com (secure), docker.io (secure)
$ uname -a
Linux dell-r620-01.perf.lab.eng.rdu.redhat.com 3.10.0-858.el7.x86_64 #1 SMP Tue Feb 27 08:59:23 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
$
$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)
vikaschoudhary16 commented 6 years ago

DiskPressure True Sat, 10 Mar 2018 05:33:46 +0000 Sat, 10 Mar 2018 02:39:48 +0000 KubeletHasDiskPressure kubelet has disk pressure

I0310 06:32:39.985159   55790 helpers.go:829] eviction manager: observations: signal=allocatableMemory.available, available: 131710276Ki, capacity: 131815376Ki
I0310 06:32:39.985171   55790 helpers.go:843] eviction manager: thresholds - ignoring grace period: threshold [signal=imagefs.available, quantity=8049131839] observed 7809724Ki
I0310 06:32:39.985179   55790 helpers.go:843] eviction manager: thresholds - reclaim not satisfied: threshold [signal=imagefs.available, quantity=8049131839] observed 7809724Ki
I0310 06:32:39.985193   55790 eviction_manager.go:284] eviction manager: node conditions - observed: [DiskPressure]

After freeing up some space of root file-system, it got completed without any problems.

Answer to the question that why it was working with 3.7 and failing with 3.9 only is because following code block which is there is 3.9 and not in 3.7, in fact whole webconsole.go :

    err = wait.Poll(1*time.Second, 10*time.Minute, func() (bool, error) {
        glog.V(2).Infof("polling for web console server availability")
        ds, err := kubeClient.Extensions().Deployments(consoleNamespace).Get("webconsole", metav1.GetOptions{})
        if err != nil {
            return false, err
        }
        if ds.Status.ReadyReplicas > 0 {
            return true, nil
        }
        return false, nil
    })

Closing the issue.