giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

KVM Release v13.0.0 with Kubernetes v1.18 #54

Closed cornelius-keller closed 3 years ago

cornelius-keller commented 4 years ago

Kubernetes v1.18.x

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-XXX.md

Provide a new release with the kubernetes version v1.18.x

Check migration recommendations

Check migration recommendations from kubernetes and decide what we need to document for the customer and what we should migrate automatically for them

Run e2e and conformance tests

Check Core Components

Test migration (both cluster functionality itself and workloads)

Write summary for the release and update docs

stone-z commented 4 years ago

Began testing release defined in this PR. Reviewing components for release, two components were updated today which should be included:

stone-z commented 4 years ago

Kube-state-metrics 1.2.0 and node-exporter 1.4.1 have been added to the release. They are internal changes to the app packaging only

brinker211 commented 4 years ago

Taking into KVM.

yulianedyalkova commented 3 years ago

There is an issue with kube-state-metrics that requires a manual delete of the deployment on upgrade. More context here.

Currently it's not possible to run cncf against our test installations because of the test clusters mentioned here: https://github.com/giantswarm/giantswarm/issues/14236

yulianedyalkova commented 3 years ago

So I tried running cncf 3 times in prow and got 3 different failures:

  1. A test failed and then sonobuoy couldn't give info of what's happening:

    PLUGIN     STATUS   RESULT   COUNT
    e2e     failed   failed       1
    systemd-logs   complete   passed       4
    
    [2020-11-18T15:43:23Z] Results summary
    timeout waiting for results
    
    [2020-11-18T15:43:25Z] Sonobuoy logs
    time="2020-11-18T15:43:35Z" level=error msg="could not create sonobuoy client: couldn't get sonobuoy api helper: could 
    not get api group resources: Get \"https://api.fgd7h.k8s.gorgoth.gridscale.kvm.gigantic.io/api?timeout=32s\": net/http: TLS 
    handshake timeout"
  2. Coredns didn't come up within the timeout

  3. Preflight checks failed

    [run-tests : run-tests] [2020-11-19T14:08:40Z] Running cncf
    [run-tests : run-tests] time="2020-11-19T14:08:40Z" level=error msg="Preflight checks failed"
    [run-tests : run-tests] time="2020-11-19T14:08:40Z" level=error msg="namespace already exists"

I tried running cncf against the old release and it failed because coredns took 1 hour to come up (2. failure). The only app in the TC namespace was the chart-operator, the chart was installed within the TC but the pod itself was missing.

I'm also assuming that 1. and 3. have something to do with rfjh2 being updated at the same time. I ran cncf 3 times manually and they were all green (except the fact that coredns takes a long time to come up).

tfussell commented 3 years ago

I have a PR to fix the slow start of k8s-addons (which installs CoreDNS) here https://github.com/giantswarm/k8scloudconfig/pull/832 I'm testing it today.

yulianedyalkova commented 3 years ago

Just for trackability: the problem with coredns coming up is that chart-operator cannot get deployed because of:

Events:
  Type     Reason        Age                 From                   Message
  ----     ------        ----                ----                   -------
  Warning  FailedCreate  96s (x19 over 23m)  replicaset-controller  Error creating: pods "chart-operator-cd7957c78-" is forbidden: no PriorityClass with name giantswarm-critical was found

This eventually gets created after ~40 minutes and then everything is happy. It also happens for release v12.3.2 so it's not a regression of the new release.

MarcelMue commented 3 years ago

Just for trackability: the problem with coredns coming up is that chart-operator cannot get deployed because of:

Events:
  Type     Reason        Age                 From                   Message
  ----     ------        ----                ----                   -------
  Warning  FailedCreate  96s (x19 over 23m)  replicaset-controller  Error creating: pods "chart-operator-cd7957c78-" is forbidden: no PriorityClass with name giantswarm-critical was found

This eventually gets created after ~40 minutes and then everything is happy. It also happens for release v12.3.2 so it's not a regression of the new release.

Interesting - do we know where the prio class is coming from? IIRC it was in k8scloudconfig

yulianedyalkova commented 3 years ago

Yesterday I managed to follow it through k8scloudconfig -> CLOUD_CONFIG_PATH in k8s-kvm -> qemu_node_setup -> qemu. I didn't see where it actually gets created though, will spend some time on it later.

tfussell commented 3 years ago

The priority class is created in k8s-addons which is fixed in my k8scloudconfig PR. I'll merge it and update kvm-operator today.