v0.13.5 trying to create already existing record

dmitriishaburov commented 1 year ago

What happened:

After update to v0.13.5, external-dns is having CrashLoopBackOff trying to create already existing DNS record in Route53 that it already manages.

What you expected to happen:

Not crashing.

How to reproduce it (as minimally and precisely as possible):

Probably:

Create DNS record for LoadBalancer in Route53 via v0.13.4, then update to v0.13.5

Anything else we need to know?:

We have following DNS records for EKS LoadBalancer, which were created by external-dns v0.13.4:

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"
tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com.
tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

After update to v0.13.5, external-dns trying to recreate them and fails:

time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE a-tempo-distributor.prelive.domain TXT [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE tempo-distributor.prelive.domain  A [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=info msg="Desired change: CREATE tempo-distributor.prelive.domain  TXT [Id: /hostedzone/ID]"
time="2023-06-29T11:27:47Z" level=error msg="Failure in zone domain. [Id: /hostedzone/ID] when submitting change batch: InvalidChangeBatch: [Tried to create resource record set [name='tempo-distributor.prelive.domain.', type='A'] but it already exists, Tried to create resource record set [name='tempo-distributor.prelive.domain.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: ID"
time="2023-06-29T11:27:48Z" level=fatal msg="failed to submit all changes for the following zones: [/hostedzone/ID]"

Command line args (for both versions):

    Args:
      --log-level=info
      --log-format=text
      --interval=1m
      --source=service
      --source=ingress
      --policy=upsert-only
      --registry=txt
      --provider=aws

Environment:

External-DNS version (use external-dns --version): v0.13.5
DNS provider: Route53

zeqk commented 1 year ago

May be related to this, looks like a similar problem https://github.com/kubernetes-sigs/external-dns/issues/3484 https://github.com/kubernetes-sigs/external-dns/issues/3706

iodeslykos commented 1 year ago

We encountered the same issue described by the OP after upgrading the Helm Chart to version 1.13.0 (external-dns v0.13.5) from 1.12.2 (external-dns v0.13.4): Pods entered CrashLoopBackOff after repeated failures to create records that already existed in the target Route53 Hosted Zone.

Current working solution is to downgrade back to Helm Chart v1.12.2 (v0.13.4).

aardbol commented 1 year ago

Same issue with Google DNS. Downgrade to Helm Chart v1.12.2 (v0.13.4) works

jbilliau-rcd commented 1 year ago

We are having the same issue, the pod straight up crashes from a normal error we see all the time, through multiple version upgrades over the years.

alfredkrohmer commented 1 year ago

This seems to be caused by this changes: https://github.com/kubernetes-sigs/external-dns/pull/3009

szuecs commented 1 year ago

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor" tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com. tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

What do you mean by Yes/No ?

iodeslykos commented 1 year ago

cname-tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor" tempo-distributor.prelive.domain A Simple - Yes k8s-tempo-tempodis-xxxx.elb.eu-west-1.amazonaws.com. tempo-distributor.prelive.domain TXT Simple - No "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/tempo/tempo-distributed-distributor"

What do you mean by Yes/No ?

Yes and No are part of the record when pulled via the AWS CLI utility from Route53 and represent whether or not the record is an Alias, which is a Route53-specific extension to DNS functionality.

CAR6807 commented 1 year ago

Any update on this? Is this fixed in 1.14.0?

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

krmichelos commented 9 months ago

/remove-lifecycle stale

CAR6807 commented 8 months ago

Bump

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FernandoMiguel commented 5 months ago

/remove-lifecycle stale

iodeslykos commented 5 months ago

To ensure that the maintainers understand this is still an issue: it is.

We did upgrade past v1.12.x, but it required a significant amount of record deletions in order to allow CoreDNS to recreate the records it had previously managed without issue. Now everything is working fine.

If there is conflict in DNS we would expect the error message, but not the entire CoreDNS deployment to enter CrashLoopBackoff.

CAR6807 commented 5 months ago

Bump. This is preventing us from addressing high vulnerabilities fixed in newer version. We would like to avoid have to delete existing records to avoid potential outages. Record conflicts should not cause the external dns controller to crash outright.

mlazowik commented 2 months ago

I'm guessing this is fixed, at least for some providers, by https://github.com/kubernetes-sigs/external-dns/pull/4166?

kubernetes-sigs / external-dns

v0.13.5 trying to create already existing record #3754