rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.48k stars 217 forks source link

[SURE-7469] Imagescan breaks fleet controller #2096

Closed kkaempf closed 2 months ago

kkaempf commented 6 months ago

SURE-7469

Issue description:

Fleet controller in Rancher upstream clusters crashes if the imagen tag does not include the semver range expected when using ImageScan in the fleet.yaml file.

="2024-01-17T18:44:44Z" level=debug msg="DesiredSet - No change(2) gitjob.cattle.io/v1, Kind=GitJob fleet-default/project-test2 for gitjobs fleet-default/project-test2"
E0117 18:44:44.351563       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1491 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x22b04a0?, 0x3e240f0})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.12/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.12/pkg/util/runtime/runtime.go:49 +0x75
panic({0x22b04a0, 0x3e240f0})
    /usr/lib64/go/1.20/src/runtime/panic.go:884 +0x213
github.com/Masterminds/semver/v3.(*Version).Original(...)
    /go/pkg/mod/github.com/!masterminds/semver/v3@v3.2.1/version.go:259
github.com/rancher/fleet/internal/cmd/controller/controllers/image.semverLatest({0x3d70a50?, 0xc0026ff260?}, {0xc0058ff200, 0x2f, 0x0?})
    /go/src/github.com/rancher/fleet/internal/cmd/controller/controllers/image/image.go:501 +0x105
github.com/rancher/fleet/internal/cmd/controller/controllers/image.latestTag({0xc0075afc90?, 0xc0075afca0?}, {0xc0058ff200?, 0xc009d1a01d?, 0x44?}) 

Business impact:

Fleet goes into a crash and the GitRepos are not updated. 

Troubleshooting steps:

From the customer:  In artifact registry our latest image was built 20 hours ago and has the tag 0.0.0-52. The one deployed with the initial fleet.yaml file is 0.0.0-44. This fails to update.

Actual behavior:

Fleet controller in Rancher upstream clusters crashes if the imagen tag does not include the semver range expected

Expected behavior:

Fleet controller should not be crashed.

Files, logs, traces:

-GitRepo project-test2: project-test.yaml

-Gitjob log

-Fleet controller log

Additional notes:

As this comment remarks https://github.com/rancher/fleet/pull/413#issuecomment-880228630,  it might lack understanding how to set up the parameter semver ranger in the fleet.yaml manifest.

weyfonk commented 6 months ago

Possible duplicate of #852, as per logs shared there.

weyfonk commented 6 months ago

Crash reproduced with Fleet 0.8.1, where a small patch could fix it:

--- a/internal/cmd/controller/controllers/image/image.go
+++ b/internal/cmd/controller/controllers/image/image.go
@@ -498,5 +498,10 @@ func semverLatest(r string, versions []string) (string, error) {
                        }
                }
        }
+
+       if latestVersion == nil {
+               return "", fmt.Errorf("no available version matching %s", r)
+       }
+
        return latestVersion.Original(), nil
 }

This also needs:

kkaempf commented 6 months ago

Trivial fix (as anticipated 😉 ), thanks @weyfonk ! Added to v2.8.3 milestone

0xavi0 commented 5 months ago

I can confirm it also panics in master and v0.9.0, due to the same reason (accesses a nil latestVersion)

Fix proposed works fine. (lastestVersion is already "" before the call to the function that panics, so returning "" seems correct)

This is happening when using:

semver:
      range: "*"

in conjunction with pre-release versions.

As stated in semver docs:

When constraint checking is used for checks or validation it will follow a different set of rules that are common for ranges with tools like npm/js and Rust/Cargo. This includes considering prereleases to be invalid if the ranges does not include one. If you want to have it include pre-releases a simple solution is to include -0 in your range.

If we change

semver:
      range: "*"

to

semver:
      range: ">= 0.0.0-40"

it works fine. Reason why is because as stated in the docs above, constraints in semver consider pre-releases as invalid unless we specify a pre-release explicitly in the constraint (semver.range in fleet.yaml).

I've also tried to write end to end tests using the zot registry that we already have deployed in conjunction with uploading images with skopeo but it didn't work because of the self-signed CA... Still investigating possible fixes.

0xavi0 commented 4 months ago

QA Template

Solution

Fixes panic issue when working with pre-release image tags and the "*" range.

Testing

Create a imagescan configuration using the "*" target and images with pre-release tags. (0.0.1-10, 0.0.2-20, etc) It should not crash the controller and, it should not update the image although we push a new one.

Additionally, you should see the message: no available version matching * in the logs

Additional info

mmartin24 commented 3 months ago

QA report

Testing environment

Rancher v2.9-c9be13b09329bbee60a5f6419d500198f83c44d1-head Fleet: 104.0.0+up0.10.0-rc.11 K3s: v1.27.10+k3s1


Tested Fleet controller does not crash when image tag does not include semserver range and check no panic logs are found** (Similar test as backported version here)

Defaulted container "fleet-controller" out of: fleet-controller, fleet-cleanup, fleet-agentmanagement {"level":"error","ts":"2024-04-19T07:11:41Z","logger":"imagescan-tag-scanner","msg":"Failed get the digest","name":"imagescan-imagescan-29-imagescans-0","namespace":"fleet-local","gitrepo":"imagescan-29","latestImage":"","error":"no available version matching ","errorCauses":[{"error":"no available version matching "}],"stacktrace":"github.com/rancher/fleet/internal/cmd/controller/imagescan.(TagScanJob).updateImageTags\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/tagscan_job.go:147\ngithub.com/rancher/fleet/internal/cmd/controller/imagescan.(TagScanJob).Execute\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/tagscan_job.go:75\ngithub.com/reugn/go-quartz/quartz.executeWithRetries\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:489\ngithub.com/reugn/go-quartz/quartz.(*StdScheduler).executeAndReschedule.func1\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:482"}


So this part is ok and confirms the fix on the backend on this part.

---
### Outstanding issues

#### 1 . **Displayed UI message on 2.9 is different from 2.8**

While in `Rancher v2.8-head` the message displayed in UI is `ImageScan test-scan is not ready: no available version matching *` (expected after [backport fix](https://github.com/rancher/fleet/issues/2181#issuecomment-1971351033)), in `v2.9-c9be13b09329bbee60a5f6419d500198f83c44d1-head` the message that seems to select is the first one found: `Requires git secret for write access`
https://github.com/rancher/fleet/issues/2181#issuecomment-1999271787

This is so because there is a secret error message that occurs before the expected `ImageScan test-scan is not ready: no available version matching *` which is this one:

k logs -n cattle-fleet-system fleet-controller-69975f66b7-gs2sp | grep "requires git secret for write access" | head -1 Defaulted container "fleet-controller" out of: fleet-controller, fleet-cleanup, fleet-agentmanagement {"level":"error","ts":"2024-04-19T07:11:40Z","logger":"imagescan-clone","msg":"Cannot create temp dir to clone repo","gitrepo":"imagescan-29","namespace":"fleet-local","error":"requires git secret for write access","errorCauses":[{"error":"requires git secret for write access"}],"stacktrace":"github.com/rancher/fleet/internal/cmd/controller/imagescan.(GitCommitJob).cloneAndReplace\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/gitcommit_job.go:165\ngithub.com/rancher/fleet/internal/cmd/controller/imagescan.(GitCommitJob).Execute\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/gitcommit_job.go:88\ngithub.com/reugn/go-quartz/quartz.executeWithRetries\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:489\ngithub.com/reugn/go-quartz/quartz.(*StdScheduler).executeAndReschedule.func1\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:482"}



After the initial conversation offline, we thought it may be good to display all errors on UI but after thinking about it, not sure if we want to do so because if many happen it may clutter the UI. @kkaempf, please let me know what is best so I can open Rancher issue if needed.:
   - Show all error messages
   - Show last one
   - Leave it as it is

#### 2 . **Too much time to show the error log on UI**
Here we can see almost 30 second difference in displaying error log between 2.8 and 2.9:

https://github.com/rancher/fleet/assets/37271841/9541c74d-5ff2-4068-bfdf-a6a9b381223c

While this is not a functional issue, it is a bad UX.
While checking this, it was observed that probably this example should not always require git secret for write access, which slows things down a bit, so we open this ticket separately: https://github.com/rancher/fleet/issues/2341. This may help here, but not sure if something else wants to be noticed/addressed in this regard.
kkaempf commented 3 months ago

That is more of a usability question (and I'm not a usability expert 😉)

@kwwii might have the right advise for a better UI. 🤞🏻

kwwii commented 3 months ago
mmartin24 commented 3 months ago

Hi @kwwii, thanks for the input.

What I am not sure is what error to be displayed in the UI is the one shown in 2.8 version after the backport (ImageScan test-scan is not ready: no available version matching *) or the one in 2.9 (Requires git secret for write access). As commented in the description this later error is present in both versions but in 2.8 it switches to the second added after the fix, whereas in 2.9 it just picks the first one. Which one should be present in UI?

This is not critical for closing this issue as the main functionality has been checked; if unsure or not critical it can be dismissed, but just wanted to mention it to try clarify what to expect.

mmartin24 commented 2 months ago

I opened issue about the error displayed here: https://github.com/rancher/rancher/issues/45215 so this ticket can be closed as main validation is done.