Repeated Creation or Deletion of Entries in CloudFlare

sephethus commented 1 year ago

What happened: In going through countless logs all the way back to the 7th of March, it appears that on that day external-dns began creating and deleting entries repeatedly. Something changed. Look here:

time="2023-03-17T15:58:36Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0
time="2023-03-17T15:59:19Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0
time="2023-03-17T16:00:51Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0
time="2023-03-17T16:01:31Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0
time="2023-03-17T16:02:36Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0
time="2023-03-17T16:03:37Z" level=info msg="Changing record." action=CREATE record=a-cp-beta.prismcommunity.org ttl=1 type=TXT zone=53b6945890cb756c04ecf020564266b0

What you expected to happen: One creation and nothing else, if the application is deleted or hostname changed only then should the entry be deleted or updated.

How to reproduce it (as minimally and precisely as possible): Install external-dns on cluster where deployments are configured to use it. Install it for cloudflare? It's an automatic system I don't know, up until now external-dns for me has been a magical black box full of mysteries on how it works. GO is not my language.

Anything else we need to know?:

Environment: GCP/GKE 1.21 and Cloudflare

External-DNS version: deployed external-dns-6.14.3 0.13.3
DNS provider: cloudflare
Others:

Our values override configuration for helm install:

provider: cloudflare
cloudflare:
  email: tech@example.org
  apiKey:  {{ readFile "cloudflare.decrypted.txt" }}
domainFilters:
  - example.org
  - example2.org
  - example3.org
policy: sync
txtOwnerId: {{ .Environment.Name }}

Everything broke when external-dns bitnami/external-dns 6.14.3 was pulled down on the 7th of March, this is where we traced the logs back to. What changed?

sephethus commented 1 year ago

Update, we downgraded all the way back to August, version:

* External DNS                                                        *
***********************************************************************
  Chart version: 1.7.1
  App version:   0.10.2
  Image tag:     k8s.gcr.io/external-dns/external-dns:v0.10.2

Still happens in that version. What could have changed that would have caused this to spaz out and continuously try to recreate existing entries?

sephethus commented 1 year ago

We figured out that there was another cluster with external-dns installed and it was going in and deleting all the entries that were created, despite having a completely different txt owner id. None of it makes any sense right now, it shouldn't do that either. All our clusters from dev to prod have external-dns installed and they don't do this, but that cluster with a totally different txtOwnerId was doing it. Something is still broken here but we narrowed down the culprit at least.

dafzor commented 1 year ago

Getting this when upgrading helm chart from 1.12.2 to 1.13.0. In my case, external-dns:0.13.4 will run a refresh with no changes, external-dns:0.13.5 will constantly removes and adds the DNS records.

Things i've noticed:

0.13.5 Tries to create both CNAME and A records for entries with hostname (should be CNAME only) causing cloudflare api to return 1004 on A records
Appends "-cname" and "-a" to the txtPrefix resulting in {txtPrefix}a-mysubdomain, previous version just uses txtPrefix
Ignores blank hostname annotations creating entries

Partial log (this will loop continuously for all entries):

2023-06-09T20:10:08+01:00   time="2023-06-09T19:10:08Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:10:10+01:00   time="2023-06-09T19:10:10Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:05+01:00   time="2023-06-09T19:11:05Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=A zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:06+01:00   time="2023-06-09T19:11:06Z" level=error msg="failed to create record: DNS Validation Error (1004)" action=CREATE record=madalien.com ttl=1 type=A zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:09+01:00   time="2023-06-09T19:11:09Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:10+01:00   time="2023-06-09T19:11:10Z" level=error msg="failed to create record: Record already exists. (81057)" action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:12+01:00   time="2023-06-09T19:11:12Z" level=info msg="Changing record." action=DELETE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:16+01:00   time="2023-06-09T19:11:16Z" level=info msg="Changing record." action=DELETE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:12:05+01:00   time="2023-06-09T19:12:05Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:12:09+01:00   time="2023-06-09T19:12:09Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451

zeqk commented 1 year ago

I have the same problem to...

failed to create record: DNS Validation Error (1004)

time="2023-06-16T00:37:36Z" level=info msg="Changing record." action=DELETE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=CREATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=error msg="failed to create record: DNS Validation Error (1004)" action=CREATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=UPDATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=TXT zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=UPDATE record=a-hello-world-ingress.mydomain.com.ar ttl=1 type=TXT zone=c878d0959b80baf39244ea04f3bcecba

This is my manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mydomaincomar-external-dns
  namespace: external-dns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mydomaincomar-external-dns
  template:
    metadata:
      labels:
        app: mydomaincomar-external-dns
    spec:
      containers:
        - name: external-dns
          image: registry.k8s.io/external-dns/external-dns:v0.13.5
          args:
            - '--source=ingress'
            - '--domain-filter=mydomain.com.ar'
            - '--provider=cloudflare'
            - '--cloudflare-proxied'
            - '--cloudflare-dns-records-per-page=5000'
            - '--log-level=debug'
            - '--txt-owner-id=aks-itools-iprd-ue'
          env:
            - name: CF_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cloudflare
                  key: mydomaincomar-token
                  optional: false
          resources:
            limits:
              cpu: 10m
              memory: 32Mi
            requests:
              cpu: 5m
              memory: 16Mi

the same problem with v0.13.4 to

ss098 commented 1 year ago

I also encountered this problem and it was alleviated by downgrading helm chart to 1.12.2 (Thanks @dafzor).

I have two environments using service annotation to configure domain names, both using CloudFlare DNS provider and configured with txt owner id (production, sandbox).

provider: cloudflare
policy: sync
txtOwnerId: production / sandbox

apiVersion: v1
kind: Service
metadata:
  name: ...
  annotations:
    external-dns.alpha.kubernetes.io/hostname: domain.com / sandbox.domain.com
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

Production runs on DigitalOcean DOKS, using domain.com, which will constantly update DNS records
The Sandbox runs on the local RKE2 cluster, using sandbox.domain.com, and everything runs normally. However, if I change the domain to domain.com, it will also constantly updating the DNS records

time="2023-11-14T12:37:55Z" level=info msg="Changing record." action=UPDATE record=domain.com ttl=1 type=A zone=...
time="2023-11-14T12:37:55Z" level=info msg="Changing record." action=UPDATE record=domain.com ttl=1 type=TXT zone=...

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

6ixfalls commented 9 months ago

/remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

6ixfalls commented 6 months ago

/remove-lifecycle stale

rookiehelm commented 6 months ago

We figured out that there was another cluster with external-dns installed and it was going in and deleting all the entries that were created, despite having a completely different txt owner id. None of it makes any sense right now, it shouldn't do that either. All our clusters from dev to prod have external-dns installed and they don't do this, but that cluster with a totally different txtOwnerId was doing it. Something is still broken here but we narrowed down the culprit at least.

@sephethus Just one thought. I too faced the same issue in similar conditions - two clusters with external-dns pods. Your comment made me do an experiment - scale down external-dns pod in an old cluster. Subsequently, the record creation succeeded. Could you try the same?

Reg. your point of not seeing this behavior under multiple environments - there is a catch that I would like you to think about. You're familiar with the flag --policy. The new releases suggest folks to set this flag to upsert-only. If you're external-dns pod in the older cluster doesn't have this flag set, then it would have permissions to delete records. And in the event if --domain-filter points to the same subdomain for both the new and old external-dns pods this could lead to your older pod to delete records that the newer pod creates. Most likely, the domain differed for your dev, stage & prod environments - so that's why you didn't see this behaviour maybe?

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

gabricar-andela commented 2 months ago

I'm also facing this on version 0.14.2.

defyjoy commented 1 month ago

I can confirm this also exists on 0.15.0

GeertJohan commented 1 month ago

Just noticed this issue as well, but only for a domain that has no subdomain (e.g. foo.com). external-dns works fine for all other hosts it's syncing which all are a subdomain (e.g. sub.bar.com).

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

6ixfalls commented 3 weeks ago

/remove-lifecycle rotten

kubernetes-sigs / external-dns

Repeated Creation or Deletion of Entries in CloudFlare #3484