Some new TXT records are not being cleaned up, causing an "InvalidChangeBatch" error

born4new commented 1 year ago

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for `<our-dns-name>.`

[]

Searching for `a-<our-dns-name>.`

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

External-DNS version (use external-dns --version): 0.13.1
DNS provider: AWS
Others:

rymai commented 1 year ago

This definitely looks similar to https://github.com/kubernetes-sigs/external-dns/issues/3007, https://github.com/kubernetes-sigs/external-dns/issues/2421, and https://github.com/kubernetes-sigs/external-dns/issues/2793.

benjimin commented 1 year ago

@born4new does setting --aws-batch-change-size=1 resolve your problem? (i.e., is it purely the batching that is broken?)

born4new commented 1 year ago

does setting --aws-batch-change-size=1 resolve your problem?

We haven't specifically tried a size of 1, but we have tried a few values (e.g. 20, 200, 1000), none of them helped.

The fix for us was to go back to an external-dns version below 0.12.0, so that external-dns wouldn't be aware of the newly introduced TXT record. This seems to indicate a problem in the way the new TXT records are cleaned up...

JonathanLachapelle commented 1 year ago

We are facing the exact same issue.

JonathanLachapelle commented 1 year ago

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.
[]
Searching for a-<our-dns-name>.
    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }
This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"
What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

External-DNS version (use external-dns --version): 0.13.1

DNS provider: AWS

Others:

Does it happen on all record or just sometime?

born4new commented 1 year ago

Does it happen on all record or just sometime?

@JonathanLachapelle It was happening on some records only.

xavidop commented 1 year ago

we faced the same issue today: We are using AWS Route53 and our External DNS version is 0.12.2

{"level":"error","msg":"InvalidChangeBatch: [Tried to create resource record set [name='cname-runtime-api-dev-amy.development.voiceflow.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 4db33c47-f34f-4a36-8d60-b2cb0750578d","time":"2022-12-14T11:11:20Z"}

ArturChe commented 1 year ago

I have faced the same issue after updating the external-dns version from 0.12.0 to 0.13.1. And instead of syncing with previously created TXT record graylog.<domain> it tries to create cname-graylog.<domain> and it fails with output below:

time="2022-12-14T11:49:54Z" level=error msg="InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-graylog.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-mongodb.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-tcp.graylog.<domain>.']\n\tstatus code: 400, request id: <Id>"
time="2022-12-14T11:49:54Z" level=info msg="Desired change: CREATE cname-graylog.<domain> TXT [Id: /hostedzone/<hostedzone>]"
...

IKohli09 commented 1 year ago

I have faced the same issue. I got a new cluster up with external chart version 6.12.1 which is using image 0.13.1 But it errors out with InvalidChangeBatch when trying to create cname-<domain> entry.

Also, when I switch back to version 0.11.0, it keeps on deleting and creating the route53 records instead of updating them. here, I am using --upsert-policy.

Desired change: CREATE 123.dev.cloud A "Desired change: CREATE 123.dev.cloud TXT Applying provider record filter for domains Desired change: CREATE 123.dev.cloud A Desired change: CREATE 123.dev.cloud TXT

It's a huge blocker.

liad5h commented 1 year ago

We are experiencing the same issue with version 0.13.1 and kubernetes 1.21 or higher. In our case when the issue happens, external-dns stops processing requests until we go to AWS and manually remove the leftovers.

logs:

time="2023-02-01T12:55:01Z" level=error msg="Failure in zone qa.controlup.com. [Id: /hostedzone/XXXXXXXXXX]"
time="2023-02-01T12:55:01Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cname-x.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 8b8e55e1-efe0-452d-96da-af65ff122fca"
time="2023-02-01T12:55:01Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/XXXXXXXXXX]"

msvticket commented 1 year ago

I'm having the same problem with version 0.13.1 and --aws-batch-change-size=100. I tried --aws-batch-change-size=1 and started to get warnings like

time="2023-02-08T15:32:34Z" level=warning msg="Total changes for xxx.yyy.zzz exceeds max batch size of 1, total changes: 2"

and the errors as described above kept coming.

So I tried --aws-batch-change-size=2 and that has actually resolved the problem for me.

jbilliau-rcd commented 1 year ago

Same problem as well. I wish there was a "force-overwrite" options where we could just tell external-dns to overwrite records; we have multiple clusters who have this error and are seemingly stuck. The worse part is good, new ingresses never have their DNS records created since they get batched up with these bogus retries.

martinohmann commented 1 year ago

We're facing the same issue with v0.13.2 and the suggested batch size changes do not work:

With --aws-batch-change-size=1: it tries to create the already existing TXT record, which fails. It does not even attempt to create the A record, presumably because the first batch change within the sync interval failed. This does not resolve itself eventually and continues like this in every sync interval.
With --aws-batch-change-size=2: it tries to create the A record and the already existing TXT record in a batch and this fails. Same behaviour as above, it's stuck.

The only option we have is to either manually create the A record, or to delete the existing TXT records so that external-dns can properly recreate everything.

The expected behaviour would be to not attempt to create the TXT records again (if anything, it should upsert existing records).

Update: from what I can see, there's already a change in master which might partially fix this (https://github.com/kubernetes-sigs/external-dns/commit/7dd84a589d4725ccf25d94f8d71b0146fee4bfcc), but it's still unreleased.

cyril94440 commented 1 year ago

Same problem here...

Kulagin-G commented 1 year ago

The same problem after updating external-dns from 0.10.2 to 0.13.4

There are some details about the environment:

Provider: aws
EKS: 1.24.0

There are details about the issue:

At the star we have 3 records:

(A) - alias for LB, host.example.com
(TXT) - old-style TXT for backward-compatibility, host.example.com
(TXT) - new-style TXT cname-host.example.com

Test - Removing new-style TXT cname-host.example.com Result: Looks ok, record was restored. time="2023-06-08T13:05:04Z" level=info msg="Desired change: CREATE cname-host.example.com. TXT [Id: /hostedzone/xxx]"
Test - Removing old-style TXT host.example.com Result: Looks ok, record was restored. time="2023-06-08T13:07:05Z" level=debug msg="Adding host.example.com. [Id: /hostedzone/xxx]"
Test - Removing old-style TXT and new-style TXT Result: records were not restored, and no issues or attempts in the logs.
Test - Removing alias host.example.com and both TXT Result: ok, all 3 records were restored. time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding cname-host.example.com. to zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com TXT [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com A [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE cname-host.example.com TXT [Id: /hostedzone/xxx]"
Test - Removing alias host.example.com only Result: failure, alias was not restored. time="2023-06-08T13:22:23Z" level=error msg="Failure in zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB] when submitting change batch: InvalidChangeBatch: [Tried to create resource record set [name='cname-host.example.com.', type='TXT'] but it already exists, Tried to create resource record set [name='host.example.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: xxx" time="2023-06-08T13:22:24Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/xxx]"

I guess force override won't lead to issues with Rate exceeded from AWS API, because a case when we lost an alias record is very rare, for us at least. But still, the current behavior is pretty uncomfortable and non-expected, I want to be 100% sure that all our records will be restored automatically if any shit happens.

Additionally, it's weird I don't see any logs in case p.3

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ddieulivol commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

CameronMackenzie99 commented 4 months ago

/remove-lifecycle stale

rookiehelm commented 4 months ago

I'm seeing this issue when installing v0.14.1 on a brand new EKS 1.25.

sileyang-sf commented 4 months ago

Same issue happened in our EKS cluster in version 1.26.

rookiehelm commented 4 months ago

Hi guys, I was able to resolve my errors. Couple of pointers that helped:

First thing is that the external-dns repo has various branches tagged with release versions. But the release versions don't correspond directly to the image version hosted on GCR.
My issue got resolved after I used the following image: registry.k8s.io/external-dns/external-dns:v0.14.1. Please also follow the instructions using the branch tagged v0.14.1 (and not master or some other branch)
In my case my cluster was setup using terraform scripts, as I needed to deploy kubeflow. I accidentally have configured the IRSA using 'eksctl' command which was incorrect. The docs suggest directly creating the serviceaccount via 'kubectl'. Please be careful here. I had to manually delete the previous SA and re-create my SA using the right commands. Post that everything worked fine.
I also needed to configure 'ingress-nginx' controller first (and not later) as 'external-dns' needs to work with the loadbalancer during the creation of the records (correct me if I'm wrong here)

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / external-dns