solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 435 forks source link

"translating proxies error" after upgrading from 1.16.9. to 1.17.1 when "gloo.gateway.persistProxySpec=true" #9968

Open htech7x opened 2 weeks ago

htech7x commented 2 weeks ago

Gloo Edge Product

Enterprise

Gloo Edge Version

1.17.1

Kubernetes Version

1.28.5

Describe the bug

After upgrading Gloo EE from 1.16.9 to 1.17.1 when "gloo.gateway.persistProxySpec=true": Newly created "virtual services" no longer have a "status". "Proxy" object stops updating its configuration.

Expected Behavior

"Proxy" updates its config when a new VS is created

Steps to reproduce the bug

  1. Install Gloo EE 1.16.9 with the option "gloo.gateway.persistProxySpec=true"
    helm install gloo glooe/gloo-ee --version $GLOO_EE_VERSION --namespace gloo-system --create-namespace --set-string license_key=$GLOO_LICENSE_KEY --set gloo.gateway.persistProxySpec=true --set gloo-fed.enabled=false
  2. Create VS, for example https://docs.solo.io/gloo-edge/latest/guides/traffic_management/hello_world/
  3. Upgrade Gloo EE to 1.17.1
    export NEW_VERSION=1.17.1
    helm pull glooe/gloo-ee --version $NEW_VERSION --untar
    kubectl apply -f gloo-ee/charts/gloo/crds
    helm get values gloo -n gloo-system > values.yaml
    helm upgrade -n gloo-system gloo glooe/gloo-ee \
    -f values.yaml \
    --version=$NEW_VERSION \
    --set license_key=$LICENSE_KEY
  4. Check the "gloo" logs:
    kubectl logs deploy/gloo -n gloo-system
    ...
    {"level":"error","ts":"2024-08-28T18:41:09.630Z","logger":"gloo-ee.v1.event_loop.setup","caller":"setup/setup_syncer.go:1066","msg":"gloo main event loop","version":"1.17.1","error":"event_loop.gloo: 1 error occurred:\n\t* translating proxies: 1 error occurred:\n\t* reconciling resource gateway-proxy: updating kube resource gateway-proxy: (want 43162817): proxies.gloo.solo.io \"gateway-proxy\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\n\n\n\n","errorVerbose":"1 error occurred:\n\t* translating proxies: 1 error occurred:\n\t* reconciling resource gateway-proxy: updating kube resource gateway-proxy: (want 43162817): proxies.gloo.solo.io \"gateway-proxy\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\n\n\n\n\nevent_loop.gloo\ngithub.com/solo-io/go-utils/errutils.AggregateErrs\n\t/go/pkg/mod/github.com/solo-io/go-utils@v0.25.1/errutils/aggregate_errs.go:19\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/syncer/setup.RunGlooWithExtensions.func11\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.17.4/projects/gloo/pkg/syncer/setup/setup_syncer.go:1066"}
  5. Create new pod/svc and VS:
    --- web.yaml
    apiVersion: gateway.solo.io/v1
    kind: VirtualService
    metadata:
    name: web
    namespace: gloo-system
    spec:
    virtualHost:
    domains:
    - "web.com"
    routes:
    - matchers:
      - prefix: /
      routeAction:
        single:
          upstream:
            name: default-web-80
            namespace: gloo-system
kubectl run web --image nginx --expose --port 80
kubectl apply -f web.yaml
  1. Check status of VS(newly created VS has no status):

    glooctl get vs
    +-----------------+--------------+---------+------+----------+-----------------+-----------------------------------+
    | VIRTUAL SERVICE | DISPLAY NAME | DOMAINS | SSL  |  STATUS  | LISTENERPLUGINS |              ROUTES               |
    +-----------------+--------------+---------+------+----------+-----------------+-----------------------------------+
    | pet             |              | pet.com | none | Accepted |                 | / ->                              |
    |                 |              |         |      |          |                 | gloo-system.default-petstore-8080 |
    |                 |              |         |      |          |                 | (upstream)                        |
    | web             |              | web.com | none |          |                 | / ->                              |
    |                 |              |         |      |          |                 | gloo-system.default-web-80        |
    |                 |              |         |      |          |                 | (upstream)                        |
    +-----------------+--------------+---------+------+----------+-----------------+-----------------------------------+
  2. Check "proxy" config(there is nothing about the recently created VS "web"):

    
    kubectl get proxies gateway-proxy -n gloo-system -o yaml
    apiVersion: gloo.solo.io/v1
    kind: Proxy
    metadata:
    creationTimestamp: "2024-08-28T18:38:17Z"
    generation: 4
    labels:
    created_by: gloo-gateway-translator
    name: gateway-proxy
    namespace: gloo-system
    resourceVersion: "43162817"
    uid: e98940cf-e833-4a48-a424-a1d476601fab
    spec:
    listeners:
    - bindAddress: '::'
    bindPort: 8080
    httpListener:
      virtualHosts:
      - domains:
        - pet.com
        metadataStatic:
          sources:
          - observedGeneration: "3"
            resourceKind: '*v1.VirtualService'
            resourceRef:
              name: pet
              namespace: gloo-system
        name: gloo-system.pet
        routes:
        - matchers:
          - prefix: /
          metadataStatic:
            sources:
            - observedGeneration: "3"
              resourceKind: '*v1.VirtualService'
              resourceRef:
                name: pet
                namespace: gloo-system
          options:
            prefixRewrite: /api/pets
          routeAction:
            single:
              upstream:
                name: default-petstore-8080
                namespace: gloo-system
    metadataStatic:
      sources:
      - observedGeneration: "3"
        resourceKind: '*v1.Gateway'
        resourceRef:
          name: gateway-proxy
          namespace: gloo-system
    name: listener-::-8080
    useProxyProto: false
    - bindAddress: '::'
    bindPort: 8443
    httpListener: {}
    metadataStatic:
      sources:
      - observedGeneration: "3"
        resourceKind: '*v1.Gateway'
        resourceRef:
          name: gateway-proxy-ssl
          namespace: gloo-system
    name: listener-::-8443
    useProxyProto: false
    status:
    statuses:
    gloo-system:
      reportedBy: gloo
      state: 1


### Additional Environment Detail

_No response_

### Additional Context

_No response_

┆Issue is synchronized with this [Asana task](https://app.asana.com/0/1206768562311555/1208221908461926) by [Unito](https://www.unito.io)
sam-heilbron commented 2 weeks ago

Internal slack thread: https://solo-io-corp.slack.com/archives/CEDCS8TAP/p1724785852085279 Potentially relevant PR: https://github.com/solo-io/gloo/pull/9310#discussion_r1562623909

soloio-bot commented 2 weeks ago

Zendesk ticket #4392 has been linked to this issue.

sam-heilbron commented 1 week ago

This is another instance of https://github.com/solo-io/gloo/issues/6406. I think in part we got bit by this because we do not recommend using persistProxySpec=true and migrated all of our tests to use the recommended setting. We need a single test that verifies that when proxies are persisted, you can upgrade without error (and Gloo continues to process resources)

soloio-bot commented 3 days ago

Zendesk ticket #4499 has been linked to this issue.

sam-heilbron commented 2 days ago

I documented reproduction steps here as well: https://github.com/solo-io/gloo-gateway-shared-resources/tree/main/issues/gloo/9968

nfuden commented 2 days ago

Work around is to delete the proxy resource and then gloo will self heal