rancher / dashboard

The Rancher UI
https://rancher.com
Apache License 2.0
463 stars 262 forks source link

Rancher 2.7.5 UI - Incorrect State Restricting Upgradability Of Downstream RKE2 + local K3S Versions #10032

Open nathanielnderson opened 1 year ago

nathanielnderson commented 1 year ago

Setup

Describe the bug Rancher UI is unable to display available RKE2

To Reproduce I have 2 seperate Rancher Clusters. Both are setup with the following configuration:

SuSE 15.5 across all nodes K3S cluster v1.26.7 running Rancher v2.7.5 Downstream RKE2 cluster v1.26.7 Both downstream RKE2 clusters were provisioned with the same API calls.

Open up cluster configuration page for either K3S or RKE2 clusters in Rancher -> Try to upgrade minor version from 1.26.7 to 1.26.8.

Result Page does not show future minor version for Kubernetes for either RKE2 or K3S available. It only shows that 1.26.7 is the current version.

RKE2 Cluster also does not display Networking tab available.

Expected Result Should display minor version v1.26.8 available for upgrading for both K3S and RKE2

Screenshots Functional Cluster with the same build: image

Cluster with no upgrades available: image

Additional context I've already confirmed both Rancher images are the same: Unfunctional Cluster:

kubectl get pod --namespace=cattle-system rancher-7769775dfb-77f8z -o json | jq '.status.containerStatuses[] | { "image": .image, "imageID": .imageID }'
{
  "image": "docker.io/rancher/rancher:v2.7.5",
  "imageID": "docker.io/rancher/rancher@sha256:5ba20e4e51a484f107f3f270fa52c5e609cad0692dd00a26169cc3541b1f3788"
}

Functional Cluster:

kubectl get pod --namespace=cattle-system rancher-7769775dfb-f5m5s -o json | jq '.status.containerStatuses[] | { "image": .image, "imageID": .imageID }'
{
  "image": "docker.io/rancher/rancher:v2.7.5",
  "imageID": "docker.io/rancher/rancher@sha256:5ba20e4e51a484f107f3f270fa52c5e609cad0692dd00a26169cc3541b1f3788"
}
richard-cox commented 1 year ago

@nathanielnderson In your browser's dev tools network tab can you confirm that the response to https://<rancher domain>/v1-k3s-release/release is the same for both Rancher's?

nathanielnderson commented 1 year ago

@richard-cox up the network tab to take a look, I'm not sure what exactly I need to be looking for. I found the release response.

Working Cluster response:

HTTP/2 200 
cache-control: no-cache, no-store, must-revalidate
content-encoding: gzip
content-type: application/json
date: Mon, 13 Nov 2023 18:50:54 GMT
expires: Wed 24 Feb 1982 18:42:00 GMT
x-api-cattle-auth: true
x-api-schemas: https://rancher.{redacted}.s365.us/v1-k3s-release/schemas
x-content-type-options: nosniff
content-length: 1677
X-Firefox-Spdy: h2

Cluster not working response:

HTTP/2 200 
cache-control: no-cache, no-store, must-revalidate
content-encoding: gzip
content-type: application/json
date: Mon, 13 Nov 2023 18:50:52 GMT
expires: Wed 24 Feb 1982 18:42:00 GMT
x-api-cattle-auth: true
x-api-schemas: https://rancher.{redacted}.s365.us/v1-k3s-release/schemas
x-content-type-options: nosniff
content-length: 1326
X-Firefox-Spdy: h2
richard-cox commented 1 year ago

It'll be the data that's returned we need to compare. Looking again at your screenshots i got the endpoint wrong, it'll be /v1-rke2-release/releases instead of /v1-k3s-release/releases

For example it'll return the following data. We need to confirm if that data is the same between the two instances

image

nathanielnderson commented 11 months ago

I upgraded my Rancher version and this resolved the issue I was talking about here.

However! I just noticed that the Rancher cluster version now has a similar issue. For both environments I am on the same K3S Rancher cluster version and I have the same downstream RKE2 cluster version.

On one cluster, however, I see that the K3S versions available to upgrade to, do not match what the other cluster is capable of.

K3S Rancher Cluster on v1.26.7 - test environment: image

K3S Rancher Cluster on v1.26.7 - prod environment: image

Is there something here not communicating properly and hence not pulling down all available versions within the test environment?

nathanielnderson commented 11 months ago

I was able to get this resolved, but I do still have questions remaining. The end all be all solution for this wound up being a manual set of the proper resolvers in CoreDNS.

Strangely enough the hosts had working resolvers, but CoreDNS refused to use /etc/resolv.conf on the upstream hosts for some reason. The nodes were all utilizing the same one's across the cluster and they were not sym-linked or anything of that nature so CoreDNS really should've been able to pick up on them but even after attempting removing the /var/lib/rancher/k3s/agent/etc/resolv.conf and purging on pod/container out for it, it would still get hung up on trying to fall back to the 8.8.8.8 or attempt to read from the var/lib resolv.conf.

The behavior for CoreDNS was set to be normal network mode and Default for the DNS Policy, which from what I read of the Wiki should look to the upstream hosts /etc/resolv.conf. This is function on the prod cluster no problem, but for whatever reason this test environment refuses to recognize it.

Once I was able to get that sorted out I set the nameservers manually and refreshed the git-repos and the higher versions for RKE2/K3S populated as they should though so that was able to get fixed. The question still remains on why CoreDNS isn't functioning as intended though.