Closed tomassedovic closed 5 years ago
@smarterclayton @wking I think you guys were debugging this issue (or a similar one) a couple of days ago.
similar with aws, reproducible with 0.9.1 0.10 and master
INFO Creating cluster...
INFO Waiting up to 30m0s for the Kubernetes API...
INFO API v1.11.0+c69f926354 up
INFO Waiting up to 30m0s for the bootstrap-complete event...
ERROR: logging before flag.Parse: E0117 08:32:08.405777 24389 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV:
WARNING Failed to connect events watcher: Get https://sbatsche-api.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 3.87.84.74:6443: connect: connection refused
[..]
FATAL waiting for bootstrap-complete: timed out waiting for the condition
master-0 $ crictl ps -a
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
8b9bc36c0c276 quay.io/coreos/etcd@sha256:cb9cee3d9d49050e7682fde0a9b26d6948a0117b1b4367b8170fcaa3960a57b8 35 minutes ago Running etcd-member 0
35833a944e129 quay.io/coreos/kube-client-agent@sha256:d68f85b5ca3adccdc2f4a4c5263f1792798ed44a9b1d63a96004b6e283dc338d 35 minutes ago Exited certs 0
cf8bdc1886764 registry.svc.ci.openshift.org/openshift/origin-v4.0@sha256:3d19dfefa543bb8194e18c48716c7132f9c9b982611a14a14eb5cceeaa60bcf4 35 minutes ago Exited discovery 0
etcd shows
2019-01-17T13:10:11.150313341+00:00 stderr F 2019-01-17 13:10:11.150277 I | embed: rejected connection from "127.0.0.1:37686" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
Going to pull the etcd certs and take a closer look. etcd has stopped on all 3 nodes so that is a start.
so my server cert SAN looks like this
DNS:localhost, DNS:etcd.kube-system.svc, DNS:etcd.kube-system.svc.cluster.local, DNS:sbatsche-etcd-0.devcluster.openshift.com, IP Address:10.0.15.187, IP Address:127.0.0.1
peer like so
DNS:sbatsche-etcd-0.devcluster.openshift.com, DNS:sbatsche.devcluster.openshift.com, IP Address:10.0.15.187
but my peer does not include localhost or 127.0.0.1 if we are using the peer cert with a client this would explain the error. Is this a regression ? I will try to backtrack through this process.
AWS, 0.9.1
INFO Waiting up to 30m0s for the bootstrap-complete event... ERROR: logging before flag.Parse: E0116 17:16:21.986936 75976 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="" WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 53
I was able to verify etcd is actually working correctly.
crictl exec 134f8b9a259c1 /bin/sh -c "ETCDCTL_API=3 etcdctl --endpoints="https://10.0.15.187:2379,https://10.0.30.46:2379,https://10.0.35.250:2379" --cert /etc/ssl/etcd/system:etcd-peer:sbatsche-etcd-0.devcluster.openshift.com.crt --key /etc/ssl/etcd/system:etcd-peer:sbatsche-etcd-0.devcluster.openshift.com.key --cacert /etc/ssl/etcd/ca.crt endpoint status -w table"
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.15.187:2379 | b261054f85985ee6 | 3.3.10 | 160 kB | false | 2 | 522 |
| https://10.0.30.46:2379 | 1e915deba4617a88 | 3.3.10 | 164 kB | true | 2 | 522 |
| https://10.0.35.250:2379 | 74e95148d793879b | 3.3.10 | 156 kB | false | 2 | 522 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
Note the absence of machine-api-operator, clusterapi-manager-controllers and so on.
This looks like cluster version operator is failing to progress on installing operators.
oc get clusterversion version -oyaml
will give you insights to the status of cluster version operator
oc logs deploy/cluster-version-operator -n openshift-cluster-version
to get logs from cluster version operator
The most recent set of changes to the CVO ensure that we can now do:
oc wait clusterversion/version --for=condition=available
during install to wait until the payload is delivered, which should mean all operators report they are happy and healthy. I'm going to be tweaking the startup loops in the CI env for than now.
On Thu, Jan 17, 2019 at 4:34 PM Abhinav Dahiya notifications@github.com wrote:
Note the absence of machine-api-operator, clusterapi-manager-controllers and so on.
This looks like cluster version operator is failing to progress on installing operators. oc get clusterversion version -oyaml will give you insights to the status of cluster version operator oc logs deploy/cluster-version-operator -n openshift-cluster-version to get logs from cluster version operator
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/installer/issues/1092#issuecomment-455340274, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p3ZwFjfslKGBeRNl-f7sI2ZwHkkuks5vEOx1gaJpZM4aFIPl .
Thanks for the CVO pointers. I'm seeing this:
- lastTransitionTime: 2019-01-18T10:56:21Z
message: Could not update clusteroperator "kube-controller-manager" (config.openshift.io/v1,
59 of 218)
reason: UpdatePayloadFailed
status: "True"
type: Failing
- lastTransitionTime: 2019-01-17T11:53:08Z
message: 'Unable to apply 4.0.0-0.alpha-2019-01-17-070151: the update could not
be applied'
reason: UpdatePayloadFailed
status: "True"
type: Progressing
- lastTransitionTime: 2019-01-17T11:53:08Z
message: 'Unable to retrieve available updates: Get http://localhost:8080/graph:
dial tcp [::1]:8080: connect: connection refused'
reason: RemoteFailed
status: "False"
type: RetrievedUpdates
Checking the logs doesn't seem to work for me though. I'm getting this error:
$ oc logs deploy/cluster-version-operator -n openshift-cluster-version
Error from server: Get https://ostest-master-0:10250/containerLogs/openshift-cluster-version/cluster-version-operator-56dcfc9b89-4sw46/cluster-version-operator: remote error: tls: internal error
Okay, I dug a bit deeper. Here's the chain of events:
The installer doesn't receive the bootstrap-complete event, because progress.service
can't find the /opt/openshift/.openshift.done
file (.bootkube.done
exists). That file is supposed to be created by the openshift.service
, but it's not because of:
Jan 25 08:40:12 ostest-bootstrap openshift.sh[3333]: Creating object from file: ./99_openshift-cluster-api_cluster.yaml ...
Jan 25 08:40:12 ostest-bootstrap openshift.sh[3333]: Executing kubectl create --filename ./99_openshift-cluster-api_cluster.yaml
Jan 25 08:40:13 ostest-bootstrap openshift.sh[3333]: error: unable to recognize "./99_openshift-cluster-api_cluster.yaml": no matches for kind "Cluster" in version "cluster.k8s.io/v1alpha1"
Jan 25 08:40:13 ostest-bootstrap openshift.sh[3333]: kubectl create --filename ./99_openshift-cluster-api_cluster.yaml failed. Retrying in 5 seconds...
Jan 25 08:40:18 ostest-bootstrap openshift.sh[3333]: error: unable to recognize "./99_openshift-cluster-api_cluster.yaml": no matches for kind "Cluster" in version "cluster.k8s.io/v1alpha1"
Jan 25 08:40:18 ostest-bootstrap openshift.sh[3333]: kubectl create --filename ./99_openshift-cluster-api_cluster.yaml failed. Retrying in 5 seconds...
The 99_openshift-cluster-api_cluster.yaml
file contains the following:
apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
metadata:
creationTimestamp: null
name: ostest
namespace: openshift-cluster-api
spec:
clusterNetwork:
pods:
cidrBlocks:
- 10.128.0.0/14
serviceDomain: ""
services:
cidrBlocks:
- 172.30.0.0/16
providerSpec: {}
status: {}
That looks good to me and using kind: Cluster
should be perfectly fine. So I'm guessing something on the master control plane is not quite right.
@wking any tips on how to debug this further?
no matches for kind "Cluster"
means the CVO likely hung before pushing the Cluster
CRD (or whoever is in charge of registering Cluster
failed to do it). Check CluseterVersion, ClusterOperators, and the CVO logs?
Okay, so the issue seems to be coming from here:
oc logs kube-controller-manager-operator-578cf7cc4b-p2djt -n openshift-kube-controller-manager-operator
I0125 10:43:22.077378 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"b5eda8c5-207d-11e9-9991-fa163e3a36e9", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ObserveCloudProvidersFailed' No recognized cloud provider platform found in cloud config: map[string]interface {}{"openstack":map[string]interface {}{"externalNetwork":"public", "region":"regionOne", "trunkSupport":"1", "cloud":"standalone", "computeFlavor":"m1.medium"}}
I0125 10:43:22.102562 1 status_controller.go:160] clusteroperator/kube-controller-manager diff {"metadata":{"name":"kube-controller-manager","selfLink":"/apis/config.openshift.io/v1/clusteroperators/kube-controller-manager","uid":"bc982578-207d-11e9-afa2-fa163ea976c4","resourceVersion":"42510","generation":1,"creationTimestamp":"2019-01-25T08:46:41Z"},"spec":{},"status":{"conditions":[{"type":"Failing","status":"True","lastTransitionTime":"2019-01-25T08:46:42Z","reason":"ConfigObservationFailing","message":"ConfigObservationFailing: configmap/cluster-config-v1.kube-system: no recognized cloud provider platform found: map[string]interface {}{\"openstack\":map[string]interface {}{\"c
A: omputeFlavor\":\"m1.medium\", \"externalNetwork\":\"public\", \"region\":\"regionOne\", \"trunkSupport\":\"1\", \"cloud\":\"standalone\"}}"},{"type":"Available","status":"True","lastTransitionTime":null,"message":"3 of 3 nodes are at revision 2"},{"type":"Progressing","status":"False","lastTransitionTime":null,"reason":"AllNodesAtLatestRevision","message":"3 of 3 nodes are at revision 2"}],"versions":null,"relatedObjects":[{"group":"kubecontrollermanager.operator.openshift.io","resource":"kubecontrollermanageroperatorconfigs","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-kube-controller-manager"},{"group":"","resource":"namespaces","name":"openshift-kube-controller-manager-operator"}],"extension":null}}
B: loud\":\"standalone\", \"computeFlavor\":\"m1.medium\", \"externalNetwork\":\"public\", \"region\":\"regionOne\", \"trunkSupport\":\"1\"}}"},{"type":"Available","status":"True","lastTransitionTime":null,"message":"3 of 3 nodes are at revision 2"},{"type":"Progressing","status":"False","lastTransitionTime":null,"reason":"AllNodesAtLatestRevision","message":"3 of 3 nodes are at revision 2"}],"versions":null,"relatedObjects":[{"group":"kubecontrollermanager.operator.openshift.io","resource":"kubecontrollermanageroperatorconfigs","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-kube-controller-manager"},{"group":"","resource":"namespaces","name":"openshift-kube-controller-manager-operator"}],"extension":null}}
I0125 10:43:22.115805 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"b5eda8c5-207d-11e9-9991-fa163e3a36e9", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator kube-controller-manager changed: Failing message changed from "ConfigObservationFailing: configmap/cluster-config-v1.kube-system: no recognized cloud provider platform found: map[string]interface {}{\"openstack\":map[string]interface {}{\"computeFlavor\":\"m1.medium\", \"externalNetwork\":\"public\", \"region\":\"regionOne\", \"trunkSupport\":\"1\", \"cloud\":\"standalone\"}}" to "ConfigObservationFailing: configmap/cluster-config-v1.kube-system: no recognized cloud provider platform found: map[string]interface {}{\"openstack\":map[string]interface {}{\"cloud\":\"standalone\", \"computeFlavor\":\"m1.medium\", \"externalNetwork\":\"public\", \"region\":\"regionOne\", \"trunkSupport\":\"1\"}}"
Looks like we need to add openstack there:
Version
Platform (aws|libvirt|openstack):
openstack (but I've seen private reports on non-openstack platform as well, not sure whether libvirt or aws)
What happened?
The bootstrap-complete event is never received by the installer and there is an event decoding event in the installer log:
What you expected to happen?
The installer switches from the bootstrap control plane to the masters and deletes the bootstrap node. All the pods that are supposed to run on the masters will be there.
How to reproduce it (as minimally and precisely as possible)?
Create
ostest/install-config.yaml
with the following contents:And then install the cluster:
Anything else we need to know?
I'm seeing the same error with a checkout that was working fine for me last week (commit e539a993e81edcf598696eefddbb19f5c7b9d23c). It's also present on the master branch.
So this is probably an issue with one of the images or operators rather than a regression in the installer itself.
That is where the error manifests though, so I'm logging it here in hopes of folks here pointing me in the right direction.
The master nodes seem to come up fine:
Bootkube doesn't seem to show any unexpected problems:
But it seems a bunch of the deployments that should be running are missing:
Note the absence of
machine-api-operator
,clusterapi-manager-controllers
and so on.Similarly, there's usually more containers running on the master nodes:
No apparent errors or high restart counts though.