kubernetes-sigs / external-dns

Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services
Apache License 2.0
7.68k stars 2.56k forks source link

Why does external-dns poll? Polling causes too many API requests #484

Closed azuretek closed 1 year ago

azuretek commented 6 years ago

Is there a reason external-dns is polling? Why not watch the event stream and trigger updates that way? There's no reason to poll on an interval if you can just watch for changes. It would drastically reduce the number of API requests and also be a lot quicker to reflect changes as services and ingresses are deployed.

ideahitme commented 6 years ago

At some certain stage it might make sense to integrate "watch" capabilities, but polling is probably required anyway. For example, in case of External-DNS not running for a while, the list of services and ingresses created during that period of time should be handled as well. I am not entirely sure how well Kubernetes handles watching, but a year or so ago I found the API to be buggy.

The problem with "watching" is that we cannot simply make an API call to DNS provider on every single event, because those calls usually cost money and are normally rate limited. So with "watching" we would have to do some aggregation and batching.

We could allow to configure the polling interval to reduce the number of API calls, however I don't believe "watching" is a better solution to the "problem", especially in big clusters with lots of ingresses and services.

azuretek commented 6 years ago

I'm not seeing in the code where the polling is necessary, you can watch the event stream and just append changes as they come in and call submitChanges on the interval that's specified. You're already "batching" the way you described, it's just happening on a set interval.

The main improvement is that you eliminate API calls altogether until a change actually needs to be made.

If you're concerned about a fresh pod not being aware of changes that happened since starting you can do one initial poll to get the current state and then update as necessary.

I can contribute the code changes necessary to make this happen if that's a concern.

Just to clarify my issue and why I think this is a major problem. In our environment we use AWS and we have several clusters where external-dns is configured, we have lots of domains so every time external-dns polls we have at least zones*clusters queries to the AWS API (5 clusters with 20 zones = 100 API calls every minute) even when nothing has changed. This is causing us to hit limits with the AWS API and the only resolution is to either remove the number of domains managed by external-dns (requiring an external service to create CNAMEs for us) or to reduce the polling interval which directly impacts the speed we can deploy.

ideahitme commented 6 years ago

I don't believe it is as simple as you described, with the concepts of ownership and multi target records, you have to maintain information like who owns the record, can I modify the record, etc either in memory (cache) or do the DNS provider get call. You want to avoid the latter, but in case of in memory storage, you might as well do the diff with the previous change to see if update is required. I would make this optional and not recommended for use anyway. However, I would love to see a proposal on how to use "watch" first with proper description how external dns will operate and preserve all the features it currently has

hjacobs commented 6 years ago

Are we even talking about the same thing? Are we talking about polling the Kubernetes or the AWS API? @azuretek mentions hitting the rate limits of AWS.. Maybe we should identify the actual problem first before discussing potential solutions or improvements? Is the problem "External DNS hits AWS API rate limits"?

ideahitme commented 6 years ago

@hjacobs I think he means to use Kubernetes API events to watch for changes and then do the AWS API call/ otherwise stay idle.

Currently the problem is we fetch the list of records from AWS even if no changes are required and this is the API call we want to prevent. However External DNS is smart enough not to "post" changes to AWS API if no changes were detected.

External DNS hitting AWS API rate limiting is a problem, but I think it should be addressed in other ways, e.g. with caching result. https://github.com/kubernetes-incubator/external-dns/issues/178

prydie commented 6 years ago

How about having the controller trigger off informers watching Service/Ingress with the informer resync periods set to --interval? Then couple that with fronting the registry with a TTL cache (#178) so fetching records from the provider would occur once per --interval as it does currently.

The resync period/TTL cache would ensure that we maintained the current functionality (i.e. always ensuring state is reconciled between the provider and the cluster at least once per --interval) but would greatly improve the latency of changes in cluster being reflected in the provider.

API rate limits could be handled by exposing --cache-ttl flag or similar.

Related: #14

jhohertz commented 6 years ago

I've run into this when running in an AWS account with a large number of Route53 zones. For whatever reason, it polls zones even if there are no ingress/service/etc manifests referencing that zone. Is there any way (besides filtering on domain name param) to optimise things such that it doesn't look at zones not relevant to anything configured inside kubernetes?

(In my case the account had 250+ zones... and with no filter, despite the cluster coming up with maybe a half-dozen records on just a single zone, all 249 other zones are getting scanned, confirmed by looking at CloudTrail logs, resulting in the API throttling so badly it sometimes took 10-20 minutes before external-dns could get any records provisioned.)

For the moment I've worked around it by specifying a whitelist of domains that can get managed by external-dns to keep how much it's scanning to a minimum.

2rs2ts commented 6 years ago

Some things to add to this thread:

Watching on k8s events and batching seems fine but those aren't your only events, yeah? What happens if a record gets modified outside of external-dns' scope? A regular poll as @prydie suggests would still be wise.

@jhohertz to your point I thought that was unintuitive too but external-dns has to delete records too. That said, whitelisting domains is the way to go and that's what we do. We include all our public domains, and then only the private domains for the VPC we're running external-dns in, for each VPC.

Just ranting here, but honestly the problem here is with Amazon's APIs, which I understand we can't easily change... ideally they would give you the ability to post to an SNS topic or something like that when Route53 calls are made so we could watch on AWS events the same as we can on K8s events.

Evesy commented 5 years ago

We're seeing similar things with the Cloudflare provider.

Our account has approximately 10,000 zones which means (with the maximum pagination allowed) that's 200 API calls to return solely the zones. --domain-filter dictates that we're only actually interested in two of those zones, and in those zones, there are only about 75-100 pages of records

Cloudflare limits 1200 requests per 5 minutes which with DNS' default interval of 1m gives room for about 250 requests a minute, which based on the above means we're hitting the limit (Issue is exasperated if you reuse client credentials on more than one cluster running external DNS). Decreasing the interval is certainly a workaround but of course it does mean provisioning of services is impacted.

Would restructuring so that --domain-filter is used at the time records/zones are queried in the provider to only look at said zones, rather than just being used to filte records after they have been retrieved from the provider, or are there other considerations needed?

Raffo commented 5 years ago

Do you confirm that this is happening with the latest version released (v0.5.11)?

On Wed, Feb 27, 2019, 12:35 Mike Eves notifications@github.com wrote:

We're seeing similar things with the Cloudflare provider.

Our account has approximately 10,000 zones which means (with the maximum pagination allowed) that's 200 API calls to return solely the zones. --domain-filter dictates that we're only actually interested in two of those zones, and in those zones, there are only about 75-100 pages of records

Cloudflare limits 1200 requests per 5 minutes which with DNS' default interval of 1m gives room for about 250 requests a minute, which based on the above means we're hitting the limit (Issue is exasperated if you reuse client credentials on more than one cluster running external DNS). Decreasing the interval is certainly a workaround but of course it does mean provisioning of services is impacted.

Would restructuring so that --domain-filter is used at the time records/zones are queried in the provider to only look at said zones, rather than just being used to filte records after they have been retrieved from the provider, or are there other considerations needed?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/external-dns/issues/484#issuecomment-467828983, or mute the thread https://github.com/notifications/unsubscribe-auth/AApv1KapNU2vNlte4D9_5fAOx4u2b5qwks5vRm2BgaJpZM4SfyeV .

Evesy commented 5 years ago

Correct, 0.5.11

jlamillan commented 5 years ago

@Evesy it won't solve your problem completely, but we've been using the new --events flag introduced in this pull-request as a way to significantly reduce the number of regular poll calls to our provider while actually improving our provisioning time by combining --events with a long --interval. In our scenario, out of band DNS changes are unlikely, so has been working well for us.

rtkgjacobs commented 5 years ago

There are several key problems from looking over things and testing on a larger AWS deployments

fraenkel commented 5 years ago

In our environment, we too are hitting rate limits on AWS. I have already increased our aws retries to 10 although now I am considering 13 with a much longer interval. We have added the -events support to combat the longer interval but that too can be rate limited. Which puts us back into the same situation. There are two different features that I am thinking about which:

  1. a separate retry interval on incomplete loops. With a larger interval, we cannot wait hours for a retry, there should be a separate back off for this type of situation.

  2. caching through the plan/apply process would reduce the total call count by 2 and best case is 3. This is were I see a quick win for something simple to implement. I would have like to use the cache support but that has issues in the face of failures so I am going to avoid that. I realize this creates two "caching" solutions but I view one safer than the other.

  3. handling multiple k8s clusters. This would also help greatly but is the most amount of change and even I don't want to go down this path yet.

tsuna commented 5 years ago

In our case we settled for one AWS account per cluster. Putting even just two k8s clusters on the same AWS account easily triggers the default rate limit. Thankfully we don't have that many so it's manageable this way. It also provides us with greater isolation and accounting across clusters so it's not like we did this solely for external-dns, but just saying...

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

tbarrella commented 5 years ago

/remove-lifecycle stale

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

george-angel commented 5 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

george-angel commented 4 years ago

/remove-lifecycle stale

jlamillan commented 4 years ago

@Evesy it won't solve your problem completely, but we've been using the new --events flag introduced in this pull-request as a way to significantly reduce the number of regular poll calls to our provider while actually improving our provisioning time by combining --events with a long --interval. In our scenario, out of band DNS changes are unlikely, so has been working well for us.

FYI, support for --events flag has been merged to master, which triggers a sync loop when an Ingress/Service is added, updated, or deleted.

stevefan1999-personal commented 4 years ago

@jlamillan will this be made into 0.5.19?

ghostsquad commented 4 years ago

maybe this can be closed? Current release is v0.7.1

Though I'd like to request that flags like these are exposed somewhere. They don't appear to be documented anywhere.

jlamillan commented 4 years ago

I think so. The --events flag is available starting in v0.6.0. The setting is also available (as triggerLoopOnEvent) in version 2.18.0+ of the the Helm chart for external-dns.

sheerun commented 4 years ago

Hey everyone. 0.7.2 should fix this issue as polling interval is preserved even if --events is used (i.e. synchronization happens as soon as event happens, but no more than once per interval). Could you confirm it's fixed?

ipochi commented 4 years ago

I changed the --interval=3m and set --events flag, yet I am getting

time="2020-06-25T11:14:21Z" level=error msg="Throttling: Rate exceeded\n\tstatus code: 400,

what's the recommended --interval value to mitigate this ?

sheerun commented 4 years ago

@ipochi What version of external dns you are using?

ipochi commented 4 years ago

@ipochi What version of external dns you are using?

@sheerun 0.7.2-debian-10-r20

sheerun commented 4 years ago

Could you try to write steps to reproduce on docker image?

seanmalloy commented 4 years ago

/kind feature

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

george-angel commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

george-angel commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

george-angel commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

george-angel commented 3 years ago

/remove-lifecycle stale

fouadsemaan commented 3 years ago

@Evesy it won't solve your problem completely, but we've been using the new --events flag introduced in this pull-request as a way to significantly reduce the number of regular poll calls to our provider while actually improving our provisioning time by combining --events with a long --interval. In our scenario, out of band DNS changes are unlikely, so has been working well for us.

FYI, support for --events flag has been merged to master, which triggers a sync loop when an Ingress/Service is added, updated, or deleted.

Can another solution be to run one update at startup if the --events option is turned on. In this case if a service was added/removed while external-dns was down, the change would be picked up when external-dns is back online. If --events is turned on, do we need the more than one scheduled update?

darkpixel commented 3 years ago

I'm running into this with DigitalOcean. Their API docs say:

5,000 requests per hour
250 requests per minute (5% of the hourly total)

I have around 240 domains pointing into a cluster. Regardless of the interval setting, every time it runs, it does one large query to get the domain list, then a query for records for each of the 200 domains, followed by additional queries to update IPs.

I'd love to see an option to introduce a delay between API requests so the entirety of the run can be 'spaced out' a bit. It appears at the moment after it finishes doing the ~240 queries for all the domains, it figures out what needs to be updated and then starts hammering out API requests to update all the domains which causes me to hit the limit.

DO has a "teams" option where you can basically create multiple accounts (each having their own API token), but then you'd have to have multiple clusters and couldn't take advantage of a single managed database instance, etc...

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

george-angel commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

george-angel commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ghostsquad commented 2 years ago

/remove-lifecycle stale

darkpixel commented 2 years ago

Probably not the best solution for everyone, but I ended up working around this by spinning up two $5/mo VPS instances at DigitalOcean in two different regios.

Installed powerdns with a sqlite3 backend, enabled the webserver, set an API key, and reconfigured external-dns.

It synced around 350 domains in ~2 seconds. Goodbye provider rate-limits.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

george-angel commented 1 year ago

/remove-lifecycle stale