Clarify AWS BatchSize and Record Sets limitations

kubernetes-sigs / external-dns

Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services

Apache License 2.0

7.7k stars 2.57k forks source link

Clarify AWS BatchSize and Record Sets limitations #3203

Closed briantward closed 9 months ago

briantward commented 1 year ago

What would you like to be added:

In Throttling section, external-dns suggests adjusting --aws-batch-change-size to help manage API rate limiting. Specifically it suggests increasing the value to a higher number.

https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/aws.md#throttling

We found this terminology somewhat misleading. Perhaps a small code change or documentation/terminology change would make this better. I'm not sure yet, and I'm open to suggestions. We had to reduce the aws-batch-change-size to 10 in order not to hit the AWS error InvalidChangeBatch: [Number of records limit of 1000 exceeded.]. This because each batch change that external-dns submits contained thousands of record changes. Speaking with AWS support we found that the 1000 number is a hard limit on record changes in Route53 per batch.

For background, we had a cluster of 107 nodes and 34-37 NodePort's annotated for external-dns. In the external-dns pod logs we saw a TXT and A record CREATE and UPSERT. We also saw from the AWS console that the change record set count was high, as it appeared there was a record set for each IP address of each node in the cluster. If my math is correct, it would be 34 x 2 x 107 = 7276 records (if all records were created new but of course some were probably just updates/reconciles). But it appeared to me that it was all being submitted in one batch to AWS, but there is probably more to it than that. We originally tried increasing to 4000 as in the doc, which continued to fail with the same error. Then we tried 250 and continued to fail. When we got to 10, we got things working again. If we know the hard limit on records in each AWS batch, why would we not automatically calculate the appropriate batch size based on the record limit? Also, what happens on clusters even more busy, when batch size of 1 might contain more than 1000 record requests? I suspect it would fail.

Why is this needed:

To improve the quality of life. :-)

benjimin commented 1 year ago

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests

Looks like batching is recommended to avoid the 5 API request per second rate limit/throttling. But since the batches are capped by AWS at 1000 records (or 500 upserts, or perhaps even less if the value string is typically more than about 62 characters) it seems very odd that the external-dns docs suggest trying 4000.

@briantward are you saying that the number of records per batch can be greater than the specified batch size? (I would have expected only 3 records per DNS address, e.g. 3x37 in your example, although presumably for NodePorts they would all get updated every time any of your 107 nodes goes offline or gets replaced? But.. if you had 107 ip addresses in your value strings, and have to upsert all of the records at once, you'd surely be close to the character length cap - I wonder if that produces the same error message? I'm curious now, in the Route53 console, how long your value strings on TXT and other [probably A] records is?)

briantward commented 1 year ago

It appeared that each IP was its own record set. I'll try to get a screen shot of some AWS logs showing that in our nonprod, but my test env doesn't quite have this scale so I'm hesitant to reproduce it at the moment. This means that since external-dns was batching size by something else other than record count, we were way over that 1000 record limit. Choosing batch size of 1000 for example does not directly correlate to 1000 records in that batch. So maybe it's a bug? There were some other issues we found submitted by others who saw the same behavior. Only by drastically reducing batch size to something like 10 were they able to get record counts under 1000.

On Wed, Nov 30, 2022, 6:43 PM Ben Lewis @.***> wrote:

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests

Looks like batching is recommended https://aws.amazon.com/premiumsupport/knowledge-center/route-53-avoid-throttling-errors/ to avoid the 5 API request per second rate limit/throttling. But since the batches are capped by AWS at 1000 records (or 500 upserts, or perhaps even less if the value string is more than about 62 characters) it seems very odd that the external-dns docs suggest trying 4000.

@briantward https://github.com/briantward are you saying that the number of records per batch can be greater than the specified batch size? (I would have expected only 3 records per DNS address, e.g. 3x37 in your example, although presumably for NodePorts they would all get updated every time any of your 107 nodes goes offline or gets replaced? But.. if you had 107 ip addresses in your value string, you'd be hitting a different cap - I wonder if that results in the same error message.)

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/external-dns/issues/3203#issuecomment-1332901932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMTLVNNO4ALS5XWXRNZQYTWK7RDPANCNFSM6AAAAAASPY7FDA . You are receiving this because you were mentioned.Message ID: @.***>

benjimin commented 1 year ago

I think you're right.. each annotated NodePort Service should imply a separate A record for every node (i.e with the same hostname but a different IP) although they are grouped together into a single record in the AWS console or a resource record set in the Route53 API. But the resource limit refers to 1000 ResourceRecord elements (not Sets) and with UPSERTs counting twice.

I can confirm that in v0.13.1 a NodePort Service (tested with about a dozen nodes) is referred to in its log output as one single record (or rather 3 records, including both versions of the TXT provenance label) even when the entire service is being created or deleted. Every time a node is added or removed from the cluster, external-dns logs another UPSERT of those 3 records. (So with ~100 nodes these "3 records" probably constitute 20% of the AWS request size limit.)

@briantward just curious but why do you want external names for multiple NodePort services on the same cluster? Isn't each different NodePort hostname going to be equivalent (routed identically) anyway? Is there a reason you're not using ClusterIP DNS internally and some kind of gateway (such as a load balancer) externally, to avoid needing clients to know unique port numbers for each service?

benjimin commented 1 year ago

I don't think it is possible to separate the individual DNS records into different batches.

Lets say we want to use the Route53 API to mimic assigning a hostname to a NodePort, that is, to manage multiple A entries with the same host domain name.

$ function route ()
{
aws route53 change-resource-record-sets --hosted-zone-id ZZZ --change-batch "{\"Changes\":[{\"Action\":\"$1\",\"ResourceRecordSet\":{\"Name\":\"test.example\",\"Type\":\"A\",\"ResourceRecords\":$2,\"TTL\":300}}]}"
}

$ route CREATE '[{"Value":"4.4.4.4"}]'

$ route CREATE '[{"Value":"1.2.3.4"},{"Value":"3.3.3.3"}]'
InvalidChangeBatch: Tried to create resource record set .. but it already exists

$ route UPSERT '[{"Value":"1.2.3.4"},{"Value":"3.3.3.3"}]'

$ route DELETE '[{"Value":"3.3.3.3"}]'
InvalidChangeBatch: Tried to delete resource record set .. but the values provided do not match the current values

$ route DELETE '[{"Value":"3.3.3.3"},{"Value":"1.2.3.4"}]'

In other words, it seems impossible to update any of the records (within the record set corresponding to a single host name) without re-specifying all of the desired records in a single API request. If you can't incrementally update the set, then you can't partition the records between batches; you can only decide how many record sets (how many services) are batched together.

briantward commented 1 year ago

@benjimin I'm still working on permission to post scrubbed logs. But here's essentially what I see:

An UPSERT is made of 2 type=TXT ResourceRecordSets and 1 type=A ResourceRecordSet. The TXT ResourceRecordSets contain a single ResourceRecord, while the A ResourceRecordSet contains N ResourceRecords, where N is the number of nodes in the cluster. The latter is what is causing our problem, because we have 107 nodes and 36 NodePorts (36x2+36x107 = 3852 on one batchsize=something larger than 10).

"changeBatch":  {
  "changes": [
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "a-myservice.mycluster.com".
        "type": "TXT",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "\"heritage=external-dns......"
            }
         ]
      }
    },
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "myservice.mycluster.com".
        "type": "TXT",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "\"heritage=external-dns......"
            }
         ]
      }
    },
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "myservice.mycluster.com".
        "type": "A",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "192.168.1.2"
            },
            {
              "value": "192.168.1.3"
            },
            {
              "value": "192.168.1.4"
            },
         ]
      }
    },
...

So this seems to be an issue with AWS APIs defining size to be defined by the number of ResourceRecordSets but having a separate limitation on the actual number of ResourceRecords. There would definitely be a problem, for example, when a single ResourceRecordSet would exceed the 1000 ResourceRecords limitation. Thankfully that's not the problem.

I could see external-dns improving upon the difficulty of the AWS API by looping through the current set and defining its delivered size to be whatever number of ResourceRecordsSets can be defined in a request before going over the 1000 ResourceRecords limit. But, that follows with two thoughts: 1) this is an AWS API problem and only a mitigation, and 2) Is building that workaround worth it-- does it cost us more compute to optimize the batch being sent--as opposed to just setting the batch size very low. One can argue that a dynamic calculation, at the least, means no one ever has to think about the batch size again. I may stab at it if I have time and you think it worth considering.

Finally, yes I completely agree, a single "gateway" hostname/ELB could suffice for all these services and has been suggested. I was told the LoadBalancer option per Service caused issues with hitting limits on their AWS SecurityGroups rules. I have also suggested and confirmed that using external-dns with externalTrafficPolicy=Local would reduce the number of A records to those where pods are actually hosted and may also be a reasonable option if we must continue to create custom DNS hostnames externally for each application (while accepting the other implications of Local externalTrafficPolicy).

Thanks!

benjimin commented 1 year ago

So I think we would like:

The logs already output for each record-set should include a (parenthetical) mention of how many records each record-set contains.
The default limit setting should be 500 (the aws/route53 max no. of upsert records per change-set), and the docs should not encourage a higher value than this.
Unrelated record-sets should not be combined into the same change-set if this will cause the total record count to exceed the size limit.

I'm less confident about the behaviour we want if a single record-set individually has too many records. I think it should still be attempted, so that users can continue to specify --aws-batch-change-size=1 as a work-around for other faults that cause some batches to fail. Also, I wondered if related record-sets (such as the corresponding TXT records) should be combined in the same change-set for sake of transactional/atomic behaviour, but I think such complication is unnecessary.

@briantward I still think your use-case is a bit unusual as NodePorts are often discouraged; a common pattern is to use ingresses to control a single gateway (such as an ALB or even just nginx) that proxies traffic to multiple services with different external hostnames. Whereas if you have services A, B, etc and you want your clients to access them like hostA.mydomain:30001, hostB.mydomain:30002 etc, how do your clients know which port number to use for each service, and are you worried about your DNS record sets being too large for UDP queries?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nipr-jdoenges commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 9 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/3203#issuecomment-1900580965): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.