Closed cornelius-keller closed 3 years ago
Began testing release defined in this PR. Reviewing components for release, two components were updated today which should be included:
1.2.0
1.4.1
Kube-state-metrics 1.2.0
and node-exporter 1.4.1
have been added to the release. They are internal changes to the app packaging only
Taking into KVM.
There is an issue with kube-state-metrics
that requires a manual delete of the deployment on upgrade. More context here.
Currently it's not possible to run cncf against our test installations because of the test clusters mentioned here: https://github.com/giantswarm/giantswarm/issues/14236
So I tried running cncf 3 times in prow and got 3 different failures:
A test failed and then sonobuoy couldn't give info of what's happening:
PLUGIN STATUS RESULT COUNT
e2e failed failed 1
systemd-logs complete passed 4
[2020-11-18T15:43:23Z] Results summary
timeout waiting for results
[2020-11-18T15:43:25Z] Sonobuoy logs
time="2020-11-18T15:43:35Z" level=error msg="could not create sonobuoy client: couldn't get sonobuoy api helper: could
not get api group resources: Get \"https://api.fgd7h.k8s.gorgoth.gridscale.kvm.gigantic.io/api?timeout=32s\": net/http: TLS
handshake timeout"
Coredns didn't come up within the timeout
Preflight checks failed
[run-tests : run-tests] [2020-11-19T14:08:40Z] Running cncf
[run-tests : run-tests] time="2020-11-19T14:08:40Z" level=error msg="Preflight checks failed"
[run-tests : run-tests] time="2020-11-19T14:08:40Z" level=error msg="namespace already exists"
I tried running cncf against the old release and it failed because coredns took 1 hour to come up (2. failure). The only app in the TC namespace was the chart-operator, the chart was installed within the TC but the pod itself was missing.
I'm also assuming that 1. and 3. have something to do with rfjh2
being updated at the same time. I ran cncf 3 times manually and they were all green (except the fact that coredns takes a long time to come up).
I have a PR to fix the slow start of k8s-addons
(which installs CoreDNS) here https://github.com/giantswarm/k8scloudconfig/pull/832 I'm testing it today.
Just for trackability: the problem with coredns coming up is that chart-operator
cannot get deployed because of:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 96s (x19 over 23m) replicaset-controller Error creating: pods "chart-operator-cd7957c78-" is forbidden: no PriorityClass with name giantswarm-critical was found
This eventually gets created after ~40 minutes and then everything is happy. It also happens for release v12.3.2
so it's not a regression of the new release.
Just for trackability: the problem with coredns coming up is that
chart-operator
cannot get deployed because of:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 96s (x19 over 23m) replicaset-controller Error creating: pods "chart-operator-cd7957c78-" is forbidden: no PriorityClass with name giantswarm-critical was found
This eventually gets created after ~40 minutes and then everything is happy. It also happens for release
v12.3.2
so it's not a regression of the new release.
Interesting - do we know where the prio class is coming from? IIRC it was in k8scloudconfig
Yesterday I managed to follow it through k8scloudconfig
-> CLOUD_CONFIG_PATH
in k8s-kvm -> qemu_node_setup -> qemu. I didn't see where it actually gets created though, will spend some time on it later.
The priority class is created in k8s-addons which is fixed in my k8scloudconfig PR. I'll merge it and update kvm-operator today.
Kubernetes v1.18.x
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-XXX.md
Provide a new release with the kubernetes version v1.18.x
k8s-addons
(e.g. https://github.com/giantswarm/k8scloudconfig/blob/4695b7d6eb35eae234d0ce0c0c09f4b412525a68/v_4_7_0/files/conf/k8s-addons#L5)Check migration recommendations
Check migration recommendations from kubernetes and decide what we need to document for the customer and what we should migrate automatically for them
Run e2e and conformance tests
Check Core Components
Test migration (both cluster functionality itself and workloads)
Write summary for the release and update docs