Open djannot opened 3 years ago
Here are some additional details provided by the customer:
we observed cds updates every 6-8 mins, a couple of them went through just fine, a specific one that was applied to 5 out of 6 pods at the same time caused the upstream to go unhealty (all requests resulted in 503 after the update) 1 pod did not see this update, such pod continued happly to serve 200
all pods received 13 cds updates
5 pods received an update at 10:04:12 all these pods started returning 503 after such update, 2 pods eventually recovered and started serving 200 again without any apparent relation with a cds update
1 pod did not receive an update at 10:04:12 however, it received an update at 10:03:41 which did not affect the upstream health check, such pod always returned 200
only the cds update at 10:04:12 affected the upstream, following updates did not change the state (pods continued serving 503 or 200)
looks similar to https://github.com/solo-io/gloo/issues/5442
This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.
Describe the bug The rolling update has been performed to change the ami image of the nodes and there have been no issues during the update. The Kubernetes nodes where the
gateway-proxy
Pods are running have been upgraded first.After the end of the operation, everything was working well, but 503 errors started to happen (with the UH flag) several minutes later. It looks like the 503 errors started to happen after a specific cds update that was applied to all the nodes but one.
There's only one main Upstream with the following configuration:
After restarting all the Gloo Edge Pods again, the issue disappeared.
To Reproduce The customer hasn't been able to reproduce the issue.
Additional context Add any other context about the problem here, e.g.