kubernetes-sigs / external-dns

Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services
Apache License 2.0
7.46k stars 2.53k forks source link

External DNS not managing multiple zones #4549

Open ophintor opened 1 month ago

ophintor commented 1 month ago

What happened We're using external-dns within EKS (v0.14.1). We have an environment with 5 different AWS Route 53 zones configured in it. We recently noticed that some records were not being updated.

What you expected to happen: I expected all records in all zones to be updated.

How to reproduce it (as minimally and precisely as possible):

  1. Manually remove all entries from AWS zone Z0005 (apart from SOA and NS)
  2. Deploy helm chart with the following custom values:
    
    domainFilters:
    - ourdomain.internal

txtOwnerId: Z0001 txtPrefix: external-dns

extraArgs:

Workaround:

  1. Comment out/remove all zone-id-filters except Z0005
  2. Redeploy
  3. All new records appear now in Z0005

Environment:

IanMoroney commented 1 month ago

You could try adding the following extraArgs in case it's a timeout issue, or a rate limit issue:

--interval=3m
--request-timeout=60s
ophintor commented 1 month ago

Thanks for the suggestion. Unfortunately none of those seem to work. I have also tried one of a mix of the following:

aws-batch-change-size
aws-batch-change-size-values
aws-batch-change-interval

Each zone will have around 250 records, so in total it should be able to add about 1250 in the 5 zones. I have tried with 10s intervals and 200 records at a time but no luck so far...

ophintor commented 1 month ago

I've been looking at the code (keep in mind I don't speak Go...) and I can see a few things that I'm not sure if I understand:

aws.go, line 602

// submitChanges takes a zone and a collection of Changes and sends them as a single transaction.
func (p *AWSProvider) submitChanges(ctx context.Context, changes Route53Changes, zones map[string]*profiledZone) error {
    // return early if there is nothing to change
    if len(changes) == 0 {
        log.Info("All records are already up to date")
        return nil
    }

After removing all the records in Z0005 and re-installing the chart, I can see from the logs in the pod that len(changes) == 0, which is not right because there are plenty of changes to be applied to that zone. I can see the 'All records are already up to date' message in the logs.

When I look back to line 585 (func (p *AWSProvider) ApplyChanges(ctx context.Context, changes *plan.Changes) error {), I can see that the list with the combined changes is created and sent to the function submiChanges above. However, it seems that the list of changes is empty, which shouldn't be.

At this point, I'm not sure where this function is called from (maybe from aws_sd.go?) and/or what's in the context. I have the suspicion that the code is not managing well having multiple zones ids all with the same name, but I can't figure out if that's the case by looking at the code.

I have tried all possible combinations of values in the chart, but the only way I can get this to work is by doing the zones one by one manually, which is far from ideal. I'm looking at maybe trying to separate the deployments so to have one per zone (pretty sure that would solve my issue) but I don't think I can do that without modifying the chart myself.

Any help would be appreciated.

leonardocaylent commented 3 weeks ago

@ophintor Can you test if with external-dns version 0.13.6 this issue also happens?

ophintor commented 3 weeks ago

We actually moved from 0.13.6 to 0.14.1, and then to 0.14.2 because of the issue, in the hope that a newer version would fix it.

At the moment I have found a workaround that involves creating one deployment per zone and that works for us, but as it is I cannot make it work.

Thanks!

leonardocaylent commented 2 weeks ago

@ophintor Thank you for sharing that information. I'll be working in something related to that function on a different issue that is not related and I'll check if I can take a look to this issue. Can you put a obfuscated example of what are the names for the 5 different hosted zones? Something like: HZ1: mydomain.com HZ2: internal.mydomain.com HZ3: us-east-1.internal.mydomain.com HZ4: external.mydomain.com HZ5 api.mydomain.com

This would help us to understand and reproduce the issue better

ophintor commented 2 weeks ago

Hello, the name of the zone is the same for all 5, so it would be something like:

HZ1: thisdomain.local HZ2: thisdomain.local HZ3: thisdomain.local HZ4: thisdomain.local HZ5 thisdomain.local

Many thanks.