solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

gloo gateway stuck on error state when removing the virtual service secret #5079

Open DoroNahari opened 3 years ago

DoroNahari commented 3 years ago

Describe the bug gloo gateway stuck on error state when removing the virtual service secret

To Reproduce Steps to reproduce the behavior:

  1. glooctl create secret tls --name upstream-tls --certchain tls.crt --privatekey tls.key
  2. kubectl apply -f to the following virtual service (taken from gloo docs)
    apiVersion: gateway.solo.io/v1
    kind: VirtualService
    metadata:
    name: animal
    namespace: gloo-system
    spec:
    displayName: animal
    sslConfig:
    secretRef:
      name: upstream-tls
      namespace: gloo-system
    sniDomains:
    - animalstore.example.com
    virtualHost:
    domains:
    - animalstore.example.com
    routes:
    - matchers:
      - exact: /animals
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore-8080
            namespace: gloo-system
  3. Remove the secret with kubectl delete secrets -n gloo-system upstream-tls
  4. Run kubectl get gateways -n gloo-system gateway-proxy-ssl -o yaml
  5. See the error (as expected): Error: SSLConfigError. Reason: SSL secret not found: list did not find secret gloo-system.upstream-tls1\n\t*
  6. Remove the virtual service: kubectl delete virtualservice -n gloo-system animal
  7. Repeat steps 4-5 and see that the error persist.
  8. Apply new virtual service, Repeat step 4 and see that the error disappeared.

Expected behavior After removing a virtual service that was without secret (after step 6) the error in the gateway-proxy-ssl should be removed.

Additional context Add any other context about the problem here, e.g.

kdorosh commented 3 years ago

Was the virtual service in step 6 removed the only virtual service selected by the ssl gateway?

I ask because if we have a gateway that doesn't select any virtual services, then we optimize and do not run the translation loop or update any statuses (in a sense, the ssl gateway is "orphaned" as a parent). This might be poor UX we could still optimize, however.

If you had several virtual services selected by this ssl gateway, then this sounds more like a bug.

Thanks!

DoroNahari commented 2 years ago

Was the virtual service in step 6 removed the only virtual service selected by the ssl gateway?

No, there were others

sam-heilbron commented 2 years ago

I was able to confirm that following the steps provided, when the invalid VirtualService is deleted, the ssl gateway is NOT updated to reflect the valid config. However, just following the steps provided meant we were running into the case where the ssl gateway is only selecting a single virtual service, so when we delete it, we do not run the translation loop, and thus do not update the status.

I repeated the testing steps, but this time applied a new tls secret upstream-tls-2 and a new virtual service animal-2, each just replicas of the provided config. This ensured the ssl gateway always had at least 1 valid virtual service. This time, when I delete the secrete the error was propagated to the gateway (as expected) and when I deleted the virtual service, the error was removed. This seems to indicate the behavior mentioned here: https://github.com/solo-io/gloo/issues/5079#issuecomment-921951005

Can you send along the ssl Gateway config, and the config for another Virtual Service selected by the ssl Gateway?

kevin-shelaga commented 2 years ago

This just happened with a customer https://solo-io-corp.slack.com/archives/C028P08TEAJ/p1652733167329269

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.