kubernetes-sigs / external-dns

Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services
Apache License 2.0
7.66k stars 2.56k forks source link

ExternalDNS deleting and then creating records. Constantly. Azure. #883

Closed PirateBread closed 4 years ago

PirateBread commented 5 years ago

As you can see below, this is not ideal behaviour.

The logs from the pod just show constantly deleting/updating records. It doesn't have any information as to why it's doing it.

I've checked, my ingress addresses are not disappearing, at least not that I can see.

image

eyvind commented 5 years ago

We're seeing the same behaviour on GKE (Google).

njuettner commented 5 years ago

What version of external are you're currently running?

jhohertz commented 5 years ago

Seems related to #879

eyvind commented 5 years ago

v0.5.10 has the problem, we have reverted to v0.5.9 which does not.

gurumaia commented 5 years ago

Exactly the same here. v0.5.9 works fine, v0.5.10 does this constantly.

lucaghersi commented 5 years ago

We are having the same issue, I posted an example in #543 We will try to revert to v0.5.9 for now.

leonvandebroek commented 5 years ago

I've had the same issue this morning. Thankfully you guys already reported this as I was aware of the loop but did not know the cause... I've also reverted to v0.5.9 (running AKS 1.11.3 in Azure by the way)

toutougabi commented 5 years ago

yep same issue here, we were saved by keeping a lock on our resource groups in azure for delete :)

jonesbusy commented 5 years ago

We are facing the same issue starting 0.5.10. 0.5.9 works fines

uritau commented 5 years ago

Same issue, but only on 0.5.10, reverting to 0.5.9 works perfectly fine:

The following loop it's happening every minute. Logs from external-dns (debug level):

level=debug msg="Retrieving Azure DNS zones."
level=debug msg="Found 1 Azure DNS zone(s)."
level=debug msg="Retrieving Azure DNS records for zone 'fulldomain.com'."
level=debug msg="Found A record for 'test-app.fulldomain.com' with target 'XX.XX.XX.XX'."
level=debug msg="Found TXT record for 'test-app.fulldomain.com' with target '\"heritage=external-dns,external-dns/owner=prod,external-dns/resource=ingress/test-app/test-app\"'."
level=debug msg="Endpoints generated from ingress: test-app/test-app: [test-app.fulldomain.com 300 IN A XX.XX.XX.XX [] test-app.fulldomain.com 300 IN A XX.XX.XX.XX []]"
level=debug msg="Removing duplicate endpoint test-app.fulldomain.com 300 IN A XX.XX.XX.XX []"
level=debug msg="Retrieving Azure DNS zones."
level=debug msg="Found 1 Azure DNS zone(s)."
level=info msg="Deleting A record named 'test-app' for Azure DNS zone 'fulldomain.com'."
level=info msg="Deleting TXT record named 'test-app' for Azure DNS zone 'fulldomain.com'."
level=info msg="Updating A record named 'test-app' to 'XX.XX.XX.XX' for Azure DNS zone 'fulldomain.com'."
level=info msg="Updating TXT record named 'test-app' to '\"heritage=external-dns,external-dns/owner=prod,external-dns/resource=ingress/test-app/test-app\"' for Azure DNS zone 'fulldomain.com'."
PirateBread commented 5 years ago

Thanks for all the other reports. I tried to downgrade to 0.5.9 and in Azure I'm now getting an API version error.

I then tried 0.5.8, same problem. Went back to 0.5.10, same problem.

I'm really confused now because up until 10 minutes ago, my External DNS was running the :latest tag and was constantly recycling DNS records.

I deleted that deployment (kubectl delete -f external-dns-manifest.yaml), and then created it. And now for some reason I'm getting API errors.

Wondering if somehow Azure is rate limiting these requests which just coincided with me trying to downgrade?

level=error msg="dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidApiVersionParameter\" Message=\"The api-version '2016-04-01' is invalid. The supported versions are '2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.\""

jhohertz commented 5 years ago

@PirateBread

Could you try this build for Azure to see if it addresses your issue?

registry.opensource.zalan.do/teapot/external-dns:v0.5.10-16-gfe39b46

PirateBread commented 5 years ago

@jhohertz

Just deployed v0.5.10-16-gfe39b46 and I'm still seeing the following:

time="2019-02-08T16:05:52Z" level=info msg="Created Kubernetes client https://xxxxx-2b0c5b7a.hcp.uksouth.azmk8s.io:443" time="2019-02-08T16:05:52Z" level=info msg="Using client_id+client_secret to retrieve access token for Azure API." time="2019-02-08T16:05:52Z" level=error msg="dns.ZonesClient#time="2019-02-08T16:05:52Z" level=info msg="Created Kubernetes client https://xxxxxxx-2b0c5b7a.hcp.uksouth.azmk8s.io:443" time="2019-02-08T16:05:52Z" level=info msg="Using client_id+client_secret to retrieve access token for Azure API." time="2019-02-08T16:05:52Z" level=error msg="dns.ZonesClient#ListByResourceGroup: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidApiVersionParameter\" Message=\"The api-version '2016-04-01' is invalid. The supported versions are '2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.\"":

If I get a chance this weekend I'm going to try and reproduce this in a completely fresh environment in my own subscription to rule out some kind of configuration issue but at this point I can't see what would be wrong?

eyvind commented 5 years ago

I can confirm that v0.5.10-16-gfe39b46 solves the eternal delete/update loop of doom on GKE.

Raffo commented 5 years ago

Thanks for the feedback, we will work on an official release which will probably land tomorrow.

0megam commented 5 years ago

I have similar problem but on AWS with version 0.5.11. ExternalDNS is constantly updating same record every two minutes (--interval=2m)

time="2019-02-19T14:21:45Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: af6f41c7-3451-11e9-bb90-1939f5de72e5"
time="2019-02-19T14:21:52Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: b3bb1bbc-3451-11e9-92a8-118f2457694e"
time="2019-02-19T14:22:10Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:22:10Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:22:10Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"
time="2019-02-19T14:24:06Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:24:06Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:24:06Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"
time="2019-02-19T14:26:25Z" level=error msg="getting records failed: Throttling: Rate exceeded\n\tstatus code: 400, request id: 5676a7c3-3452-11e9-b59c-ddd6f4af4826"
time="2019-02-19T14:26:25Z" level=info msg="Desired change: UPSERT *.mydomain.com A"
time="2019-02-19T14:26:25Z" level=info msg="Desired change: UPSERT *.mydomain.com TXT"
time="2019-02-19T14:26:25Z" level=info msg="2 record(s) in zone incapsula-qa.de. were successfully updated"

My arguments:

      --log-level=info
      --policy=upsert-only
      --provider=aws
      --registry=txt
      --interval=2m
      --source=service
0megam commented 5 years ago

Also same behavior on 0.5.9.

FridaGo commented 5 years ago

I have the same issue as @omegarus.

jhohertz commented 5 years ago

I'm not seeing the needless updates on AWS as others are experiencing, but one difference may be that I don't have any cases of trying to publish wildcard DNS records, so I am wondering if the issue is somewhat specific to the wildcard?

FridaGo commented 5 years ago

@jhohertz The DNS records I'm trying to publish don't contain wildcards, they are configured for different ingresses that contain different service host names (for ex. service.internal.domain, app.internal.domain), and I'm still experiencing this issue (I've tried to downgrade as far as v0.5.7 and it still happens).

jhohertz commented 5 years ago

I'm sorry @FridaGo I'm not sure what you're experiencing. This issue and the ones I have recently posted about are all relating to a problem that was introduced in v0.5.10.

All I can suggest is try watching the status field of the services you are attaching the DNS records to, to see if something is causing updates you aren't expecting to that status, which external-dns might be picking up on. I've seen some ingress configurations cause things like that to occur.

hjacobs commented 5 years ago

Can we close this issue as v0.5.11 was released?

0megam commented 5 years ago

@jhohertz Status field is constant and not changing.

status:
  loadBalancer:
    ingress:
    - hostname: x8076o593986511e9b2dc86r8d247u18-9901230772.us-west-1.elb.amazonaws.com
aminGwork commented 5 years ago

dnslog.txt I'm seeing this same behavior with infoblox after upgrading from 0.5.9 to 0.5.11. I'm going to try and downgrade to 5.9 to see if it resolves it. So much churn with the recycling bin that it blew up the Infoblox DB. Sample logs attached.

aslimacc commented 5 years ago

Have the same issue on v0.5.11 on GKE

Raffo commented 5 years ago

For me, on AWS, both running v0.5.9 and v0.5.11, haven't seen such a problem. Maybe it has something to do @jhohertz mentioned?

ghost commented 5 years ago

Found a solution to the problem. If you have another externaldns who have the same txt records value, the first externaldns will delete the records of the second and vice versa you should change the value of "txtOwnerId" for each externaldns deployment.

Raffo commented 5 years ago

@medanasslim great, thanks for posting an update.

Ping to @PirateBread and @aslimacc , do you have additional info to share and/or are you still experiencing this issue?

aslimacc commented 5 years ago

Works for me

jerome-lecorvaisier commented 5 years ago

Experiencing the same issue with Cloudflare and both registry.opensource.zalan.do/teapot/external-dns:v0.5.9 and registry.opensource.zalan.do/teapot/external-dns:v0.5.12.

...
    spec:
      containers:
      - args:
        - --source=ingress
        - --domain-filter=my-domain.com
        - --provider=cloudflare
        - --cloudflare-proxied
        env:
        - name: CF_API_KEY
          value: 
        - name: CF_API_EMAIL
          value: 
        image: registry.opensource.zalan.do/teapot/external-dns:v0.5.9
        imagePullPolicy: Always
...
ghost commented 5 years ago

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

jerome-lecorvaisier commented 5 years ago

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

  • args:

    • --log-level=info
    • --registry=txt
    • --interval=1m
    • --txt-owner-id=instance1

Thank you for the advice but this doesn't fix the issue. This is useful if you have multiple clusters using the same DNS zone.

aslimacc commented 5 years ago

Can you share your logs, please to see the behavior of the app?

On Mon, Apr 22, 2019 at 5:21 PM Jérôme Lecorvaisier < notifications@github.com> wrote:

I am on Cloudflare and as I said above, you should add "txt-owner-id"

Example below:

-

args:

  • --log-level=info
    • --registry=txt
    • --interval=1m
    • --txt-owner-id=instance1

Thank you for the advice but this doesn't fix the issue. This is useful if you have multiple clusters using the same DNS zone.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/external-dns/issues/883#issuecomment-485447952, or mute the thread https://github.com/notifications/unsubscribe-auth/ALK4NGXXQPM55KKTQWPQUJDPRXJY7ANCNFSM4GTSROLA .

jerome-lecorvaisier commented 5 years ago

Can you share your logs, please to see the behavior of the app? On Mon, Apr 22, 2019 at 5:21 PM Jérôme Lecorvaisier < @.***> wrote: I am on Cloudflare and as I said above, you should add "txt-owner-id" Example below: - args: - --log-level=info - --registry=txt - --interval=1m - --txt-owner-id=instance1 Thank you for the advice but this doesn't fix the issue. This is useful if you have multiple clusters using the same DNS zone. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#883 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ALK4NGXXQPM55KKTQWPQUJDPRXJY7ANCNFSM4GTSROLA .

Sure, you can see logs here https://github.com/kubernetes-incubator/external-dns/issues/992

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

anguswilliams commented 5 years ago

dnslog.txt I'm seeing this same behavior with infoblox after upgrading from 0.5.9 to 0.5.11. I'm going to try and downgrade to 5.9 to see if it resolves it. So much churn with the recycling bin that it blew up the Infoblox DB. Sample logs attached.

I'm also seeing this with the infoblox provider running v0.5.15. Removing my TTL annotations as per a previous comment resolved this issue.

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-incubator/external-dns/issues/883#issuecomment-542455151): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
heprotecbuthealsoattac commented 5 years ago

Hi, sorry to open up this ticket again but I've faced the same issue. Once removed all other sources than istio-gateway the problem ~dissapeared~.

Edit: actually it didn't. I'm investigating it further.

mlushpenko commented 4 years ago

Seeing this as well with Istio gateways and TransIP provider. We do have two instances of external-DNS for the same zone but with different txt-owner-id so that shouldn't be a problem.

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/883#issuecomment-558319643): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Xnyle commented 4 years ago

/remove-lifecycle rotten

Xnyle commented 4 years ago

/reopen

k8s-ci-robot commented 4 years ago

@Xnyle: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/883#issuecomment-590223220): >/reopen > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
valery-zhurbenko commented 4 years ago

txt-owner-id

works for me

dewnull commented 3 years ago

FYI, I had the same problem and as other suggested the issue was that I had two different external dns deployments with the same txt-owner-id. They were deleting each others records. As a temporary fix I used --policy=upsert-only

gennady-voronkov commented 3 years ago

This issue is reproducible on infoblox provider as well. it constantly does the same create-delete every minute. Please advise solution?

logs: time="2021-08-05T08:42:31Z" level=debug msg="Endpoints generated from ingress: test/demo: [demo.test..com 0 IN A 10.10.10.10 [] demo.test..com 0 IN A 10.10.10.10 []]" time="2021-08-05T08:42:31Z" level=debug msg="Removing duplicate endpoint demo.test.***.com 0 IN A 10.10.10.10 []"

arg: interval: "1m" logLevel: debug logFormat: text policy: upsert-only registry: "txt" txtPrefix: "ing" txtSuffix: "" txtOwnerId: "kcc-ing"