solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

EC2 Upstream not querying on tag (label) updates to added EC2 instances #7996

Closed georgefridrich closed 1 year ago

georgefridrich commented 1 year ago

Gloo Edge Version

1.12.x

Kubernetes Version

None

Describe the bug

EC2 Upstream determine which instances to add based on defined filters made up of key/value pairs. This works well on entry/discovery of the instance however once added to the pool if the key or value is modified and no longer matches the defined filter in the ec2 upstream the ec2 instance is not removed from the pool and continues to accept requests from GE.

Steps to reproduce the bug

Create an EC2 instance(s) with any app, e.g. WordPress and a key value pair Create EC2 upstream with a filter matching the key value in the first step. Add a matcher in a VS that defines the path and upstream for the EC2 app. Confirm the app can be access via the domain in the VS. Remove the tag from the EC2 instance or change the value. The instance is not removed and continues to field requests.

Expected Behavior

The EC2 with the updated tag will be discovered and removed.

Additional Context

Coupa.

ianmacclancy commented 1 year ago

Have we removed the filtered tags from the ec2 upstream config? While we don't subscribe to ec2 I believe that might unblock for now. If they need specific instances removed they just have to make sure the tags removed from the ec2 upstream correspond to those instances.

georgefridrich commented 1 year ago

The filtering needs to stay as-is in the upstream, the change is when the customer updates the value of the key/value in the EC2 tagging. We only seem to pick up on this on entry but once that ec2 instance is added to the GE upstream pool we never remove it based on any tagging changes. We loop every 30 seconds for updates to add new instances, but we don't seem to have logic that removes instances that no longer match the filtering criteria. I can recreate on my local AWS environment, reach out to me if that helps george.fridrich@solo.io Thanks!

nfuden commented 1 year ago

In this issue we MUST implement a simple discovery mechanic for ec2 to update registered ec2 instances for created upstreams In this issue we MAY implement discovery via a configurable polling mechanism As a part of this issue we MAY instrument an e2e test with real aws changes As a part of this issue we MUST instrument a test that shows us changing ec2 instances targeted for an ec2 upstream if their tags no longer match after a polling event takes place.

See https://github.com/solo-io/gloo/blob/v1.13.x/projects/gloo/pkg/plugins/aws/ec2/credential_batched_instances.go#L103 https://github.com/solo-io/gloo/blob/v1.13.x/projects/gloo/pkg/plugins/aws/ec2/uds_stub.go#L13

elcasteel commented 1 year ago

We already have polling for EC2 endpoints, the problem is that we only update endpoints that match our filters (not endpoints that have stopped matching filters). We build the list of matching endpoints here: https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/plugins/aws/ec2/eds.go#L101 and it makes its way back to the EndpointReconciler

I think the best solution is to update this to first get a list of the existing endpoints that match on the watched upstreams and then compare that to the new list and remove upstreams from the existing endpoint list that are not in the updated endpoints list.

Testing: Manual Verification: I can verify my hypothesis about what's going wrong by creating an endpoint that matches two upstreams and then updating the tags so that it only matches one. If I'm correct then in this case, the matches will be updated correctly. Automated: I think this can be covered by unit tests because the behavior change is in how we handle the results from AWS rather than in how/how often we talk to AWS.

Schedule: Setup/manual verification 1 day (might skip) Update EDS snapshot to include endpoints: 1 day Implement check: 1/2 day Test fix/add unit tests 1 day

elcasteel commented 1 year ago

I tried to reproduce in a real environment and found that removing endpoints when tags are changed works for me locally. I wrote a unit test to show that the bug I thought I identified on Monday was not actually causing the problem. George is going to try to reproduce and then we'll ask the user for more details.

ianmacclancy commented 1 year ago

Steps to find current ec2 endpoints

kubectl port-forward -n gloo-system deployments/gateway-proxy 19000:19000 http://localhost:19000/config_dump?include_eds

Search for ec2 IPv4 in config

yuval-k commented 1 year ago

yes @elcasteel , we always return the full list of endpoints and the reconcile should remove stale ones (ones that existed before and do not exist now). i also tested this in my local setup and seems to work

yuval-k commented 1 year ago

a way i can imagine this not working is we hit some AWS rate limits and stop getting new data altogether?

nfuden commented 1 year ago

For context here. We are still committing to working on the updates to docs on this.

The issue as described was due to having active health checks on the upstreams and continual load.

The solution was to set ignore_health_on_host_removal on the upstream. The docs we are planning to update are those concerning active health checks.