Closed jenshu closed 1 month ago
@jenshu Any update on this issue? Without a reproducer this is not actionable. Did "the customer" ever come back to you? If not, I'm gonna close this ticket (and can create a new one, or re-open if this issue surfaces again).
@DuncanDoyle - responded offline with context
Closing, as there is no reproducer. Can re-open if this surfaces again.
Steps to reproduce:
spec:
...
discovery:
...
udsOptions:
watchLabels:
expose_api: "true"
kubectl -n gloo-system delete upstream -l discovered_by=kubernetesplugin
expose_api: "true"
- Gloo will/should create an upstream in gloo-system namespace for itwatchLabels
setting above - the errors should ceaseSee https://github.com/solo-io/gloo/issues/9968#issuecomment-2329006317 for inspiration as the underlying problems are the same (different code paths however)
This doesn't seem related to persistProxySpec. It's happening with upstreams, and with persistProxySpec=false.
Here are some steps to reproduce from a clean cluster:
Create a cluster
kind create cluster
Install GG with uds watch labels:
helm install -n gloo-system gloo gloo/gloo --create-namespace --version 1.18.0-beta21 -f - <<EOF
discovery:
udsOptions:
watchLabels:
aaa: bbb
EOF
Create a service with the uds watch labels:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: example-svc
namespace: default
labels:
aaa: bbb
spec:
ports:
- name: http
port: 8000
targetPort: 8080
selector:
ccc: ddd
EOF
Verify that the Upstream was created, with the expected spec:
kubectl get us -n gloo-system default-example-svc-8000 -oyaml
Check the discovery logs.
kubectl logs -f -n gloo-system deploy/discovery
At this point, the upstream got created correctly but it seems like something then tried to update the upstream, resulting in a bunch of these errors (note, these errors may not happen every time on create.. but they do consistently happen later when we try to update the service, below):
{"level":"error","ts":"2024-09-12T18:56:44.056Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:150","msg":"failed reconciling upstreams","version":"1.18.0-beta21","discovered_by":"kubernetesplugin","upstreams":1,"error":"reconciling resource default-example-svc-8000: updating kube resource default-example-svc-8000: (want 792): upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","errorVerbose":"upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\nupdating kube resource default-example-svc-8000: (want 792)\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/clients/kube.(*ResourceClient).Write\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/clients/kube/resource_client.go:232\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.writeDesiredResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:105\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.attemptSyncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:72\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource.func1\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:66\nk8s.io/client-go/util/retry.OnError.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\nk8s.io/client-go/util/retry.OnError\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:50\ngithub.com/solo-io/solo-kit/pkg/errors.RetryOnConflict\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:117\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:64\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:44\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nreconciling resource default-example-svc-8000\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:45\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:150\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117"}
make a change to the service (e.g. add more labels)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: example-svc
namespace: default
labels:
aaa: bbb
xxx: yyy
spec:
ports:
- name: http
port: 8000
targetPort: 8080
selector:
ccc: ddd
EOF
More resourceVersion errors are seen in the logs, and the upstream does not pick up the changes (still shows old labels in discoveryMetadata
). This is a problem because further updates to the service do not get reflected in the upstream.
Note, making any change to the Service spec.selector
does not cause the Upstream to get updated, regardless of whether uds watch labels are being used (and no errors in the logs). Looks like there is an issue for that already https://github.com/solo-io/gloo/issues/8227
@DuncanDoyle what versions does this make sense to target this for?
I believe I've gotten to the root cause of this
In a nutshell:
DiscoveryMetadata
watchLabels
are set, Discovery includes those labels in the selector passed to Reconcile()
Reconcile()
attempts to get the original resource(s) via a List()
call here, none are returned because the generated Upstream(s) do not have the labels specified in the selectorThe key question from here is: Should Upstreams created by discovery contain the watchLabels? or perhaps all labels from the Service?
Having toyed with both options, they both seem viable in the sense that they allow resyncs to occur successfully
Removing labels from the UpstreamSelector seems to be the better option for a couple of reasons:
This behavior has been present since this repo was first split out from solo-projects so there is little context around why this decision was made, and I'm not clear if the selector was actually ever populated in the opts
passed to StartUDS()
at that point
Created upstreams do get the discovered_by: kubernetesplugin
label which is selected on and should limit any performance impact incurred by expanding the scope of the selector
It is possible, though unlikely, that Upstreams that are not created by Discovery would have that label, but even in this case, there would be no additional calls to etcd, just a larger return from a List()
call and additional elements to iterate over, which happens once when converting the list to a map, so any performance impact should be negligible
Therefore I'm comfortable treating this behavior as a bug and "fixing" it accordingly
A fix for this has merged to OSS main
and will be backported to all supported LTS branches (ie 1.14)
@DuncanDoyle, please advise if there is any particular version that will require OSS and/or EE release ASAP or if it's sufficient to have this bugfix to roll out whenever the next release(s) otherwise occur
A fix for this has now been released in all LTS versions for OSS and EE as well as OSS beta OSS: 1.18.0-beta24, 1.17.11, 1.16.21, 1.15.32, 1.14.32 EE: 1.17.3, 1.16.15, 1.15.22, 1.14.23 Additionally this fix will be in EE 1.18.0-beta2
https://solo-io.zendesk.com/agent/tickets/4679 Seems to be similar
Gloo Edge Product
Enterprise
Gloo Edge Version
v1.14.x
Kubernetes Version
1.27
Describe the bug
A customer reported that gloo discovery is not creating all the upstreams that they expect. Deleting and recreating the services doesn't seem to help.
When checking the
discovery
pod logs, they saw errors like:{"level":"error","ts":"2023-08-30T12:43:51.281Z","logger":"uds.v1.event_loop.uds","caller":"syncer/setup_syncer.go:117","msg":"error: event_loop.uds: error in uds plugin : reconciling resource default-xxx-80: updating kube resource default-xxx-80: (want 2398796740): upstreams.gloo.solo.io \"default-xxx-80\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/uds/syncer.RunUDS.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.12/projects/discovery/pkg/uds/syncer/setup_syncer.go:117"}
Deleting the erroring upstreams didn't seem to resolve the issue (upstreams would get recreated again, with same error in logs)
Note: I have also seen a similar error (
metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
) on a clean install, but in my case it didn't seem to affect upstream creation.Recommended steps:
Expected Behavior
resourceVersion error should not occur and/or should resolve itself with retries, and expected upstreams should be created
Steps to reproduce the bug
See above
Additional Environment Detail
No response
Additional Context
No response
┆Issue is synchronized with this Asana task by Unito