solo-io / gloo

The Cloud-Native API Gateway and AI Gateway
https://docs.solo.io/
Apache License 2.0
4.09k stars 446 forks source link

Invalid resourceVersion error in upstream discovery #8635

Closed jenshu closed 1 month ago

jenshu commented 1 year ago

Gloo Edge Product

Enterprise

Gloo Edge Version

v1.14.x

Kubernetes Version

1.27

Describe the bug

A customer reported that gloo discovery is not creating all the upstreams that they expect. Deleting and recreating the services doesn't seem to help.

When checking the discovery pod logs, they saw errors like: {"level":"error","ts":"2023-08-30T12:43:51.281Z","logger":"uds.v1.event_loop.uds","caller":"syncer/setup_syncer.go:117","msg":"error: event_loop.uds: error in uds plugin : reconciling resource default-xxx-80: updating kube resource default-xxx-80: (want 2398796740): upstreams.gloo.solo.io \"default-xxx-80\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/uds/syncer.RunUDS.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.12/projects/discovery/pkg/uds/syncer/setup_syncer.go:117"}

Deleting the erroring upstreams didn't seem to resolve the issue (upstreams would get recreated again, with same error in logs)

Note: I have also seen a similar error (metadata.resourceVersion: Invalid value: 0x0: must be specified for an update) on a clean install, but in my case it didn't seem to affect upstream creation.

Recommended steps:

Expected Behavior

resourceVersion error should not occur and/or should resolve itself with retries, and expected upstreams should be created

Steps to reproduce the bug

See above

Additional Environment Detail

No response

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

DuncanDoyle commented 9 months ago

@jenshu Any update on this issue? Without a reproducer this is not actionable. Did "the customer" ever come back to you? If not, I'm gonna close this ticket (and can create a new one, or re-open if this issue surfaces again).

jenshu commented 9 months ago

@DuncanDoyle - responded offline with context

DuncanDoyle commented 9 months ago

Closing, as there is no reproducer. Can re-open if this surfaces again.

ShaunPTK commented 2 months ago

Steps to reproduce:

  1. Check the discovery pod logs to make sure there are no errors being streamed
  2. configure the Gloo settings to only auto-discover if this label is present:
    spec:
    ...
    discovery:
    ...
    udsOptions:
      watchLabels:
        expose_api: "true"
  3. remove any existing auto-discovered Upstreams kubectl -n gloo-system delete upstream -l discovered_by=kubernetesplugin
  4. configure a service with the label expose_api: "true" - Gloo will/should create an upstream in gloo-system namespace for it
  5. Check the discovery pod logs - you should see errors relating to this service
  6. remove the watchLabels setting above - the errors should cease
sam-heilbron commented 1 month ago

See https://github.com/solo-io/gloo/issues/9968#issuecomment-2329006317 for inspiration as the underlying problems are the same (different code paths however)

jenshu commented 1 month ago

This doesn't seem related to persistProxySpec. It's happening with upstreams, and with persistProxySpec=false.

Here are some steps to reproduce from a clean cluster:

  1. Create a cluster

    kind create cluster
  2. Install GG with uds watch labels:

    helm install -n gloo-system gloo gloo/gloo --create-namespace --version 1.18.0-beta21 -f - <<EOF
    discovery:
    udsOptions:
    watchLabels:
      aaa: bbb
    EOF
  3. Create a service with the uds watch labels:

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Service
    metadata:
    name: example-svc
    namespace: default
    labels:
    aaa: bbb
    spec:
    ports:
    - name: http
    port: 8000
    targetPort: 8080
    selector:
    ccc: ddd
    EOF
  4. Verify that the Upstream was created, with the expected spec:

    kubectl get us -n gloo-system default-example-svc-8000 -oyaml
  5. Check the discovery logs.

    kubectl logs -f -n gloo-system deploy/discovery

    At this point, the upstream got created correctly but it seems like something then tried to update the upstream, resulting in a bunch of these errors (note, these errors may not happen every time on create.. but they do consistently happen later when we try to update the service, below): {"level":"error","ts":"2024-09-12T18:56:44.056Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:150","msg":"failed reconciling upstreams","version":"1.18.0-beta21","discovered_by":"kubernetesplugin","upstreams":1,"error":"reconciling resource default-example-svc-8000: updating kube resource default-example-svc-8000: (want 792): upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","errorVerbose":"upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\nupdating kube resource default-example-svc-8000: (want 792)\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/clients/kube.(*ResourceClient).Write\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/clients/kube/resource_client.go:232\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.writeDesiredResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:105\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.attemptSyncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:72\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource.func1\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:66\nk8s.io/client-go/util/retry.OnError.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\nk8s.io/client-go/util/retry.OnError\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:50\ngithub.com/solo-io/solo-kit/pkg/errors.RetryOnConflict\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:117\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:64\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:44\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nreconciling resource default-example-svc-8000\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:45\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:150\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117"}

  6. make a change to the service (e.g. add more labels)

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Service
    metadata:
    name: example-svc
    namespace: default
    labels:
    aaa: bbb
    xxx: yyy
    spec:
    ports:
    - name: http
    port: 8000
    targetPort: 8080
    selector:
    ccc: ddd
    EOF

More resourceVersion errors are seen in the logs, and the upstream does not pick up the changes (still shows old labels in discoveryMetadata). This is a problem because further updates to the service do not get reflected in the upstream.

Note, making any change to the Service spec.selector does not cause the Upstream to get updated, regardless of whether uds watch labels are being used (and no errors in the logs). Looks like there is an issue for that already https://github.com/solo-io/gloo/issues/8227

nfuden commented 1 month ago

@DuncanDoyle what versions does this make sense to target this for?

bewebi commented 1 month ago

I believe I've gotten to the root cause of this

In a nutshell:

The key question from here is: Should Upstreams created by discovery contain the watchLabels? or perhaps all labels from the Service?

bewebi commented 1 month ago

Having toyed with both options, they both seem viable in the sense that they allow resyncs to occur successfully

Removing labels from the UpstreamSelector seems to be the better option for a couple of reasons:

This behavior has been present since this repo was first split out from solo-projects so there is little context around why this decision was made, and I'm not clear if the selector was actually ever populated in the opts passed to StartUDS() at that point

Created upstreams do get the discovered_by: kubernetesplugin label which is selected on and should limit any performance impact incurred by expanding the scope of the selector It is possible, though unlikely, that Upstreams that are not created by Discovery would have that label, but even in this case, there would be no additional calls to etcd, just a larger return from a List() call and additional elements to iterate over, which happens once when converting the list to a map, so any performance impact should be negligible

Therefore I'm comfortable treating this behavior as a bug and "fixing" it accordingly

bewebi commented 1 month ago

A fix for this has merged to OSS main and will be backported to all supported LTS branches (ie 1.14)

@DuncanDoyle, please advise if there is any particular version that will require OSS and/or EE release ASAP or if it's sufficient to have this bugfix to roll out whenever the next release(s) otherwise occur

bewebi commented 1 month ago

A fix for this has now been released in all LTS versions for OSS and EE as well as OSS beta OSS: 1.18.0-beta24, 1.17.11, 1.16.21, 1.15.32, 1.14.32 EE: 1.17.3, 1.16.15, 1.15.22, 1.14.23 Additionally this fix will be in EE 1.18.0-beta2

nfuden commented 3 weeks ago

https://solo-io.zendesk.com/agent/tickets/4679 Seems to be similar