Open sephethus opened 1 year ago
Update, we downgraded all the way back to August, version:
* External DNS *
***********************************************************************
Chart version: 1.7.1
App version: 0.10.2
Image tag: k8s.gcr.io/external-dns/external-dns:v0.10.2
Still happens in that version. What could have changed that would have caused this to spaz out and continuously try to recreate existing entries?
We figured out that there was another cluster with external-dns installed and it was going in and deleting all the entries that were created, despite having a completely different txt owner id. None of it makes any sense right now, it shouldn't do that either. All our clusters from dev to prod have external-dns installed and they don't do this, but that cluster with a totally different txtOwnerId was doing it. Something is still broken here but we narrowed down the culprit at least.
Getting this when upgrading helm chart from 1.12.2 to 1.13.0. In my case, external-dns:0.13.4 will run a refresh with no changes, external-dns:0.13.5 will constantly removes and adds the DNS records.
Things i've noticed:
txtPrefix
resulting in {txtPrefix}a-mysubdomain
, previous version just uses txtPrefix
Partial log (this will loop continuously for all entries):
2023-06-09T20:10:08+01:00 time="2023-06-09T19:10:08Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:10:10+01:00 time="2023-06-09T19:10:10Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:05+01:00 time="2023-06-09T19:11:05Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=A zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:06+01:00 time="2023-06-09T19:11:06Z" level=error msg="failed to create record: DNS Validation Error (1004)" action=CREATE record=madalien.com ttl=1 type=A zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:09+01:00 time="2023-06-09T19:11:09Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:10+01:00 time="2023-06-09T19:11:10Z" level=error msg="failed to create record: Record already exists. (81057)" action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:12+01:00 time="2023-06-09T19:11:12Z" level=info msg="Changing record." action=DELETE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:11:16+01:00 time="2023-06-09T19:11:16Z" level=info msg="Changing record." action=DELETE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
2023-06-09T20:12:05+01:00 time="2023-06-09T19:12:05Z" level=info msg="Changing record." action=CREATE record=madalien.com ttl=1 type=CNAME zone=232588efc58673f108728bb7c906c451
2023-06-09T20:12:09+01:00 time="2023-06-09T19:12:09Z" level=info msg="Changing record." action=CREATE record=k8s.madalien.com ttl=1 type=TXT zone=232588efc58673f108728bb7c906c451
I have the same problem to...
failed to create record: DNS Validation Error (1004)
time="2023-06-16T00:37:36Z" level=info msg="Changing record." action=DELETE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=CREATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=error msg="failed to create record: DNS Validation Error (1004)" action=CREATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=A zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=UPDATE record=hello-world-ingress.mydomain.com.ar ttl=1 type=TXT zone=c878d0959b80baf39244ea04f3bcecba
time="2023-06-16T00:37:37Z" level=info msg="Changing record." action=UPDATE record=a-hello-world-ingress.mydomain.com.ar ttl=1 type=TXT zone=c878d0959b80baf39244ea04f3bcecba
This is my manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mydomaincomar-external-dns
namespace: external-dns
spec:
replicas: 1
selector:
matchLabels:
app: mydomaincomar-external-dns
template:
metadata:
labels:
app: mydomaincomar-external-dns
spec:
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns:v0.13.5
args:
- '--source=ingress'
- '--domain-filter=mydomain.com.ar'
- '--provider=cloudflare'
- '--cloudflare-proxied'
- '--cloudflare-dns-records-per-page=5000'
- '--log-level=debug'
- '--txt-owner-id=aks-itools-iprd-ue'
env:
- name: CF_API_TOKEN
valueFrom:
secretKeyRef:
name: cloudflare
key: mydomaincomar-token
optional: false
resources:
limits:
cpu: 10m
memory: 32Mi
requests:
cpu: 5m
memory: 16Mi
the same problem with v0.13.4
to
I also encountered this problem and it was alleviated by downgrading helm chart to 1.12.2 (Thanks @dafzor).
I have two environments using service annotation to configure domain names, both using CloudFlare DNS provider and configured with txt owner id (production
, sandbox
).
provider: cloudflare
policy: sync
txtOwnerId: production / sandbox
apiVersion: v1
kind: Service
metadata:
name: ...
annotations:
external-dns.alpha.kubernetes.io/hostname: domain.com / sandbox.domain.com
external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"
domain.com
, which will constantly update DNS recordssandbox.domain.com
, and everything runs normally. However, if I change the domain to domain.com
, it will also constantly updating the DNS recordstime="2023-11-14T12:37:55Z" level=info msg="Changing record." action=UPDATE record=domain.com ttl=1 type=A zone=...
time="2023-11-14T12:37:55Z" level=info msg="Changing record." action=UPDATE record=domain.com ttl=1 type=TXT zone=...
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
We figured out that there was another cluster with external-dns installed and it was going in and deleting all the entries that were created, despite having a completely different txt owner id. None of it makes any sense right now, it shouldn't do that either. All our clusters from dev to prod have external-dns installed and they don't do this, but that cluster with a totally different txtOwnerId was doing it. Something is still broken here but we narrowed down the culprit at least.
@sephethus Just one thought. I too faced the same issue in similar conditions - two clusters with external-dns pods. Your comment made me do an experiment - scale down external-dns
pod in an old cluster. Subsequently, the record creation succeeded. Could you try the same?
Reg. your point of not seeing this behavior under multiple environments - there is a catch that I would like you to think about. You're familiar with the flag --policy
. The new releases suggest folks to set this flag to upsert-only
. If you're external-dns
pod in the older cluster doesn't have this flag set, then it would have permissions to delete records. And in the event if --domain-filter
points to the same subdomain for both the new and old external-dns
pods this could lead to your older pod to delete records that the newer pod creates. Most likely, the domain differed for your dev, stage & prod environments - so that's why you didn't see this behaviour maybe?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I'm also facing this on version 0.14.2.
I can confirm this also exists on 0.15.0
Just noticed this issue as well, but only for a domain that has no subdomain (e.g. foo.com
). external-dns works fine for all other hosts it's syncing which all are a subdomain (e.g. sub.bar.com
).
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
What happened: In going through countless logs all the way back to the 7th of March, it appears that on that day external-dns began creating and deleting entries repeatedly. Something changed. Look here:
What you expected to happen: One creation and nothing else, if the application is deleted or hostname changed only then should the entry be deleted or updated.
How to reproduce it (as minimally and precisely as possible): Install external-dns on cluster where deployments are configured to use it. Install it for cloudflare? It's an automatic system I don't know, up until now external-dns for me has been a magical black box full of mysteries on how it works. GO is not my language.
Anything else we need to know?:
Environment: GCP/GKE 1.21 and Cloudflare
Our values override configuration for helm install:
Everything broke when
external-dns bitnami/external-dns 6.14.3
was pulled down on the 7th of March, this is where we traced the logs back to. What changed?