Stops at: Wait for the ServiceMonitor CRD to be created

frippe75 commented 5 years ago

Description

Deploying using openstack playbooks. Worked through some issues with DNS. But fail quite late in the process

Version

Running on CentOS 7.5 , Openstack Rocky Release installed via packstack

Installed via 3.11 release zip file
openshift-ansible-openshift-ansible-3.11.59-1

$  ansible --version
ansible 2.7.4
  config file = /home/centos/openshift-on-openstack/ansible.cfg
  configured module search path = [u'/home/centos/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

Observed Results

Describe what is actually happening.

fatal: [master-0.os.lab.net]: FAILED! => {
    "attempts": 30,
    "changed": true,
    "cmd": [
        "oc",
        "get",
        "crd",
        "servicemonitors.monitoring.coreos.com",
        "-n",
        "openshift-monitoring",
        "--config=/tmp/openshift-cluster-monitoring-ansible-HpkEk0/admin.kubeconfig"
    ],
    "delta": "0:00:01.398161",
    "end": "2018-12-20 15:12:04.452865",
    "invocation": {
        "module_args": {
            "_raw_params": "oc get crd servicemonitors.monitoring.coreos.com -n openshift-monitoring --config=/tmp/openshift-cluster-monitoring-ansible-HpkEk0/admin.kubeconfig",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2018-12-20 15:12:03.054704",
    "stderr": "No resources found.\nError from server (NotFound): customresourcedefinitions.apiextensions.k8s.io \"servicemonitors.monitoring.coreos.com\" not found",
    "stderr_lines": [
        "No resources found.",
        "Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io \"servicemonitors.monitoring.coreos.com\" not found"
    ],
    "stdout": "",
    "stdout_lines": []

Failure summary:
  1. Hosts:    master-0.os.lab.net
     Play:     Configure Cluster Monitoring Operator
     Task:     Wait for the ServiceMonitor CRD to be created
     Message:  non-zero return code

PLAY RECAP ***************************************************************************************************************************************
app-node-0.os.lab.net      : ok=198  changed=30   unreachable=0    failed=0
app-node-1.os.lab.net      : ok=198  changed=30   unreachable=0    failed=0
app-node-2.os.lab.net      : ok=222  changed=30   unreachable=0    failed=0
etcd-0.os.lab.net          : ok=105  changed=8    unreachable=0    failed=0
etcd-1.os.lab.net          : ok=105  changed=8    unreachable=0    failed=0
etcd-2.os.lab.net          : ok=122  changed=9    unreachable=0    failed=0
infra-node-0.os.lab.net    : ok=198  changed=30   unreachable=0    failed=0
localhost                  : ok=141  changed=18   unreachable=0    failed=0
master-0.os.lab.net        : ok=628  changed=167  unreachable=0    failed=1
master-1.os.lab.net        : ok=346  changed=53   unreachable=0    failed=0
master-2.os.lab.net        : ok=346  changed=53   unreachable=0    failed=0

For long output or logs, consider using a gist

frippe75 commented 5 years ago

Searched masters and found

$ sudo oc get pods --all-namespaces --config=/tmp/openshift-cluster-monitoring-ansible-HpkEk0/admin.kubeconfig
NAMESPACE              NAME                                           READY     STATUS             RESTARTS   AGE
openshift-monitoring   cluster-monitoring-operator-6465f8fbc7-ng4rv   0/1       ImagePullBackOff   0          5h

A few sec's later

openshift-monitoring   cluster-monitoring-operator-6465f8fbc7-ng4rv   0/1       ErrImagePull   0          5h

Doing a oc describe pod

Events:
  Type     Reason   Age                 From                              Message
  ----     ------   ----                ----                              -------
  Normal   BackOff  1h (x816 over 5h)   kubelet, infra-node-0.os.lab.net  Back-off pulling image "quay.io/coreos/cluster-monitoring-operator:v0.1.1"
  Normal   Pulling  56m (x46 over 5h)   kubelet, infra-node-0.os.lab.net  pulling image "quay.io/coreos/cluster-monitoring-operator:v0.1.1"
  Warning  Failed   6m (x1063 over 5h)  kubelet, infra-node-0.os.lab.net  Error: ImagePullBackOff
  Warning  Failed   22s (x54 over 5h)   kubelet, infra-node-0.os.lab.net  Failed to pull image "quay.io/coreos/cluster-monitoring-operator:v0.1.1": rpc error: code = Canceled desc = context canceled

How can I restart, pull it manually or whatever to fix this??

frippe75 commented 5 years ago

I'm behind a fairly slow proxy. If this is related to timeouts or something similar. Could I extend them.

A manual docker pull work fine. But is a bit lengthy

DizzyThermal commented 5 years ago

@frippe75 I'm experiencing the same issue here.

Wait for the ServiceMonitor CRD to be created:

Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io \"servicemonitors.monitoring.coreos.com\" not found

I'm still poking around to see if it's my configuration or not.

If I add:

openshift_cluster_monitoring_operator_install=false

to my inventory, it will skip this step, but that isn't ideal -- additionally, if I do skip this install, the Web Console will fail to install.

DizzyThermal commented 5 years ago

Some additional information:

In my environment, running oc get pods --all-namespaces returns:

NAMESPACE              NAME                                           READY     STATUS             RESTARTS   AGE
default                docker-registry-1-deploy                       0/1       Pending            0          2m
default                registry-console-1-deploy                      0/1       Pending            0          2m
default                router-1-deploy                                0/1       Pending            0          3m
kube-system            master-api-master.example.com                  1/1       Running            0          7m
kube-system            master-controllers-master.example.com          1/1       Running            0          7m
kube-system            master-etcd-master.example.com                 1/1       Running            0          7m
openshift-monitoring   cluster-monitoring-operator-6465f8fbc7-hkxb9   0/1       Pending            0          2m
openshift-node         sync-s5xsp                                     1/1       Running            0          5m
openshift-node         sync-sws9w                                     1/1       Running            0          4m
openshift-node         sync-w5rbt                                     1/1       Running            0          4m
openshift-sdn          ovs-49pjp                                      1/1       Running            0          4m
openshift-sdn          ovs-7m2vk                                      1/1       Running            0          4m
openshift-sdn          ovs-t7zfc                                      1/1       Running            0          4m
openshift-sdn          sdn-rlbk8                                      0/1       CrashLoopBackOff   5          4m
openshift-sdn          sdn-wf7lf                                      0/1       CrashLoopBackOff   5          4m
openshift-sdn          sdn-z27xl                                      0/1       CrashLoopBackOff   7          4m

May be a red herring, but my openshift-sdn nodes are also in a CrashLoopBackOff state.. I'm not sure if this is causing the ServiceMonitor CRD to fail, or if there's another root cause here.

mmgil commented 5 years ago

Is this occurring only in version 3.11?

i will try a install with openshift_release="3.10"

vrutkovs commented 5 years ago

How can I restart, pull it manually or whatever to fix this??

You should re-pull it manually and rerun the playbook

May be a red herring, but my openshift-sdn nodes are also in a CrashLoopBackOff state

Different problem, SDN is required for cluster to function.

None of the problems are openshift-ansible issues, closing this.

Please avoid commenting with "I have the same problem too" - in most cases these are different problems and have different solutions, open a new issue instead.

liuyatao commented 5 years ago

@DizzyThermal Is this problem resolved?

frippe75 commented 5 years ago

How can I restart, pull it manually or whatever to fix this??

You should re-pull it manually and rerun the playbook None of the problems are openshift-ansible issues, closing this.

I tried to re-deploy again with the same issue at the same step. Actually happens 100/100. So is the "wait-period" for this pull shorter than for other images or simply due to the size? I will try to find the answer myself but a playbook that fails consistently and for multiple users in something that could be solved and considered an issue. I will see if I can address it but simply closing it without discussing or reasoning about it seems forced. Sorry, dont want this to come across in a rude way.

trivoallan commented 5 years ago

Hi,

I'm having the same issue upgrading from 3.10 to 3.11

sdressler commented 5 years ago

Same here with 3.11.

gonzalo- commented 5 years ago

Same here with 3.11.

openshift / openshift-ansible