Closed kkaempf closed 2 months ago
Possible duplicate of #852, as per logs shared there.
Crash reproduced with Fleet 0.8.1
, where a small patch could fix it:
--- a/internal/cmd/controller/controllers/image/image.go
+++ b/internal/cmd/controller/controllers/image/image.go
@@ -498,5 +498,10 @@ func semverLatest(r string, versions []string) (string, error) {
}
}
}
+
+ if latestVersion == nil {
+ return "", fmt.Errorf("no available version matching %s", r)
+ }
+
return latestVersion.Original(), nil
}
This also needs:
master
.Trivial fix (as anticipated 😉 ), thanks @weyfonk ! Added to v2.8.3 milestone
I can confirm it also panics in master
and v0.9.0
, due to the same reason (accesses a nil
latestVersion)
Fix proposed works fine. (lastestVersion is already "" before the call to the function that panics, so returning "" seems correct)
This is happening when using:
semver:
range: "*"
in conjunction with pre-release versions.
As stated in semver docs:
When constraint checking is used for checks or validation it will follow a different set of rules that are common for ranges with tools like npm/js and Rust/Cargo. This includes considering prereleases to be invalid if the ranges does not include one. If you want to have it include pre-releases a simple solution is to include -0 in your range.
If we change
semver:
range: "*"
to
semver:
range: ">= 0.0.0-40"
it works fine.
Reason why is because as stated in the docs above, constraints in semver
consider pre-releases as invalid unless we specify a pre-release explicitly in the constraint (semver.range in fleet.yaml
).
I've also tried to write end to end tests using the zot
registry that we already have deployed in conjunction with uploading images with skopeo
but it didn't work because of the self-signed CA... Still investigating possible fixes.
Fixes panic issue when working with pre-release image tags and the "*" range.
Create a imagescan configuration using the "*" target and images with pre-release tags. (0.0.1-10, 0.0.2-20, etc) It should not crash the controller and, it should not update the image although we push a new one.
Additionally, you should see the message: no available version matching *
in the logs
Rancher v2.9-c9be13b09329bbee60a5f6419d500198f83c44d1-head
Fleet: 104.0.0+up0.10.0-rc.11
K3s: v1.27.10+k3s1
Tested Fleet controller does not crash when image tag does not include semserver range and check no panic logs are found** (Similar test as backported version here)
Repo: https://github.com/0xavi0/fleet-examples/
Branch: test-imagescans
Path: imagescans
➜ ~ k get pod -n cattle-fleet-system fleet-controller-7db5cc987b-55d5j
NAME READY STATUS RESTARTS AGE
fleet-controller-7db5cc987b-55d5j 1/1 Running 0 23m
→ k logs -n cattle-fleet-system fleet-controller-69975f66b7-gs2sp | grep "panic"
Defaulted container "fleet-controller" out of: fleet-controller, fleet-cleanup, fleet-agentmanagement
→ k logs -n cattle-fleet-system fleet-controller-69975f66b7-gs2sp | grep "no available version matching *" | head -1
Defaulted container "fleet-controller" out of: fleet-controller, fleet-cleanup, fleet-agentmanagement {"level":"error","ts":"2024-04-19T07:11:41Z","logger":"imagescan-tag-scanner","msg":"Failed get the digest","name":"imagescan-imagescan-29-imagescans-0","namespace":"fleet-local","gitrepo":"imagescan-29","latestImage":"","error":"no available version matching ","errorCauses":[{"error":"no available version matching "}],"stacktrace":"github.com/rancher/fleet/internal/cmd/controller/imagescan.(TagScanJob).updateImageTags\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/tagscan_job.go:147\ngithub.com/rancher/fleet/internal/cmd/controller/imagescan.(TagScanJob).Execute\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/tagscan_job.go:75\ngithub.com/reugn/go-quartz/quartz.executeWithRetries\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:489\ngithub.com/reugn/go-quartz/quartz.(*StdScheduler).executeAndReschedule.func1\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:482"}
So this part is ok and confirms the fix on the backend on this part.
---
### Outstanding issues
#### 1 . **Displayed UI message on 2.9 is different from 2.8**
While in `Rancher v2.8-head` the message displayed in UI is `ImageScan test-scan is not ready: no available version matching *` (expected after [backport fix](https://github.com/rancher/fleet/issues/2181#issuecomment-1971351033)), in `v2.9-c9be13b09329bbee60a5f6419d500198f83c44d1-head` the message that seems to select is the first one found: `Requires git secret for write access`
https://github.com/rancher/fleet/issues/2181#issuecomment-1999271787
This is so because there is a secret error message that occurs before the expected `ImageScan test-scan is not ready: no available version matching *` which is this one:
k logs -n cattle-fleet-system fleet-controller-69975f66b7-gs2sp | grep "requires git secret for write access" | head -1 Defaulted container "fleet-controller" out of: fleet-controller, fleet-cleanup, fleet-agentmanagement {"level":"error","ts":"2024-04-19T07:11:40Z","logger":"imagescan-clone","msg":"Cannot create temp dir to clone repo","gitrepo":"imagescan-29","namespace":"fleet-local","error":"requires git secret for write access","errorCauses":[{"error":"requires git secret for write access"}],"stacktrace":"github.com/rancher/fleet/internal/cmd/controller/imagescan.(GitCommitJob).cloneAndReplace\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/gitcommit_job.go:165\ngithub.com/rancher/fleet/internal/cmd/controller/imagescan.(GitCommitJob).Execute\n\t/home/runner/work/fleet/fleet/internal/cmd/controller/imagescan/gitcommit_job.go:88\ngithub.com/reugn/go-quartz/quartz.executeWithRetries\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:489\ngithub.com/reugn/go-quartz/quartz.(*StdScheduler).executeAndReschedule.func1\n\t/home/runner/go/pkg/mod/github.com/reugn/go-quartz@v0.11.2/quartz/scheduler.go:482"}
After the initial conversation offline, we thought it may be good to display all errors on UI but after thinking about it, not sure if we want to do so because if many happen it may clutter the UI. @kkaempf, please let me know what is best so I can open Rancher issue if needed.:
- Show all error messages
- Show last one
- Leave it as it is
#### 2 . **Too much time to show the error log on UI**
Here we can see almost 30 second difference in displaying error log between 2.8 and 2.9:
https://github.com/rancher/fleet/assets/37271841/9541c74d-5ff2-4068-bfdf-a6a9b381223c
While this is not a functional issue, it is a bad UX.
While checking this, it was observed that probably this example should not always require git secret for write access, which slows things down a bit, so we open this ticket separately: https://github.com/rancher/fleet/issues/2341. This may help here, but not sure if something else wants to be noticed/addressed in this regard.
That is more of a usability question (and I'm not a usability expert 😉)
@kwwii might have the right advise for a better UI. 🤞🏻
Hi @kwwii, thanks for the input.
What I am not sure is what error to be displayed in the UI is the one shown in 2.8 version after the backport (ImageScan test-scan is not ready: no available version matching *
) or the one in 2.9 (Requires git secret for write access
). As commented in the description this later error is present in both versions but in 2.8 it switches to the second added after the fix, whereas in 2.9 it just picks the first one. Which one should be present in UI?
This is not critical for closing this issue as the main functionality has been checked; if unsure or not critical it can be dismissed, but just wanted to mention it to try clarify what to expect.
I opened issue about the error displayed here: https://github.com/rancher/rancher/issues/45215 so this ticket can be closed as main validation is done.
SURE-7469
Issue description:
Fleet controller in Rancher upstream clusters crashes if the imagen tag does not include the semver range expected when using ImageScan in the fleet.yaml file.
Business impact:
Fleet goes into a crash and the GitRepos are not updated.
Troubleshooting steps:
From the customer: In artifact registry our latest image was built 20 hours ago and has the tag 0.0.0-52. The one deployed with the initial fleet.yaml file is 0.0.0-44. This fails to update.
Actual behavior:
Fleet controller in Rancher upstream clusters crashes if the imagen tag does not include the semver range expected
Expected behavior:
Fleet controller should not be crashed.
Files, logs, traces:
-GitRepo project-test2: project-test.yaml
-Gitjob log
-Fleet controller log
Additional notes:
As this comment remarks https://github.com/rancher/fleet/pull/413#issuecomment-880228630, it might lack understanding how to set up the parameter semver ranger in the fleet.yaml manifest.