solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.02k stars 433 forks source link

Metric validation_gateway_solo_io_valid_config seems to get stuck sometimes. #8427

Open Stefan-Balta opened 1 year ago

Stefan-Balta commented 1 year ago

Gloo Edge Version

1.13.x

Kubernetes Version

1.23.x

Describe the bug

When creating or deleting VirtualServices, Upstreams, UpstreamGroups and Services in batch, logs of the gloo pod will contain the following:


{"level":"warn","ts":"2023-06-29T08:39:28.234Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:146","msg":"Proxy had invalid config after xds sanitization","version":"1.13.20","proxy":"name:\"gateway-proxy\"  namespace:\"gloo-system\"","error":"3 errors occurred:  
        * invalid resource gloo-system.gateway-proxy    
        * upstream group not found, (Name: demo12, Namespace: pes)  
        * WARN:    [Route Warning: InvalidDestinationWarning. Reason: *v1.UpstreamGroup { pes.demo12 } not found Route Warning: InvalidDestinationWarning. Reason: *v1.UpstreamGroup { pes.demo12 } not found]  "}

{"level":"warn","ts":"2023-06-29T08:43:42.181Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:146","msg":"Proxy had invalid config after xds sanitization","version":"1.13.20","proxy":"name:\"private-gateway-proxy\"  namespace:\"gloo-system\"","error":"2 errors occurred:
    * invalid resource pes.demo12
    * destination # 1: upstream not found: list did not find upstream pes.demo12-9898
"}

I believe this happens due to the order of creation/deletion (Upstream may be deleted before a VirtualService or a VirtualService may be created before Upstream). The problem is that the metric validation_gateway_solo_io_valid_config may get stuck at value 0.

This happens in v1.13.20 and v1.14.9, but not in v1.12.33 and v.12.56.

glooctl check doesn't report any errors.

Steps to reproduce the bug

Create a Kubernetes manifest with the following resources, in the following order:

  1. Service
  2. Deployment
  3. Upstream
  4. UpstreamGroup
  5. VirtualService
  6. Additional VirtualService

Apply the manifest and delete it. The metric may or may not get stuck at value 0.

Expected Behavior

I expect the metric to stay at value 1.

Additional Context

No response

DraganDjuricOB commented 8 months ago

This issue is still present, in both v1.13.27 and v1.14.21

DanijelaPet commented 8 months ago

The issue remains in v1.15.14.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.