solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.06k stars 433 forks source link

Use WaitForCacheSync in Gloo Fed #7057

Open kcbabo opened 2 years ago

kcbabo commented 2 years ago

Gloo Edge Version

1.11.x

Kubernetes Version

No response

Describe the bug

Gloo Fed is failing to federate resources prior to cache being ready, which results in resources ending up in failed state. Retrying from Gloo Fed fixes the issue because the cache eventually syncs, but we should just wait for the cache sync in Gloo Fed to eliminate these spurious errors. This can possibly be done by setting the skv2 reconciler options to have WaitForCacheSync=true

{"level":"error","ts":1661842646.8193612,"logger":"gloo-fed","msg":"Failed to list upstreams","version":"1.11.37","error":"the cache is not started, can not read objects","stacktrace":"github.com/solo-io/solo-projects/projects/gloo-fed/pkg/api/fed.gloo.solo.io/v1/federation.(*federatedUpstreamReconciler).ReconcileFederatedUpstream\n\t/workspace/solo-projects/projects/gloo-fed/pkg/api/fed.gloo.solo.io/v1/federation/federation_reconcilers.go:115\ngithub.com/solo-io/solo-projects/projects/gloo-fed/pkg/api/fed.gloo.solo.io/v1/controller.genericFederatedUpstreamReconciler.Reconcile\n\t/workspace/solo-projects/projects/gloo-fed/pkg/api/fed.gloo.solo.io/v1/controller/reconcilers.go:109\ngithub.com/solo-io/skv2/pkg/reconcile.(*runnerReconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/reconcile/runner.go:204\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}

Steps to reproduce the bug

-

Expected Behavior

Gloo Fed should wait for cache to sync.

Additional Context

No response

Ido-Itz commented 1 year ago

We just had this issue occur in gloo-fed 1.12.42 (on k8s 1.23.13). The gloo-fed container keeps restarting (100+ times in a day) - strangely enough it seems like it's still federating resources to the gloo-ee instances in-between restarts.

the logs we're seeing:

{"level":"error","ts":1672662499.5425115,"logger":"gloo-fed.controller.federatedUpstream","msg":"Reconciler error","version":"1.12.42","name":"cic-di-domaintest-api-ui1","namespace":"cic-ui-poc","error":"handler error. retrying: 2 errors occurred:\n\t* the cache is not started, can not read objects\n\t* the cache is not started, can not read objects\n\n","errorVerbose":"handler error. retrying\n\tcontroller.(*Controller).processNextWorkItem:/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\n\tcontroller.(*Controller).reconcileHandler:/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:298\n\treconcile.(*runnerReconciler).Reconcile:/go/pkg/mod/github.com/solo-io/skv2@v0.26.0/pkg/reconcile/runner.go:208\n2 errors occurred:\n\t* the cache is not started, can not read objects\n\t* the cache is not started, can not read objects\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}

Not just for FederatedUpstream, but also FederatedVirtualService, FederatedMatchableHttpGateway and all other resources.

@jenshu @kcbabo - any ideas?

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.