external-dns stops updating route53 on error

lemaral commented 6 years ago

Issue: an (obviously broken) service hostname annotation is creating the following error on route53: "InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.sudbdomain.domain.com' status code: 400, request id: 115b8642-e047-11e7-83a0-a9e81aaef2a3" and apparently this is preventing any further modification of the route53 zone from external-dns. If this is the expected behaviour (any error in the batch means no further modification can be processed), it does not seem very robust as it is (it's also difficult to diagnose). I'm interested in confirmation and workarounds :) Thanks

ideahitme commented 6 years ago

I see it the opposite way, allowing only partial update is a way to enter a scenario where state of records is unknown and hence unmanageable. External-DNS has to guarantee safety of the updates e.g. not overwriting records belonging to other parties and created via other means. If we do not do batch/atomic updates this is no longer guaranteed. We can introduce an insecure mode , but with that kind of mode you are on your own and in case of misbehaving system diagnosing would become really difficult (as opposed to checking the logs and deleting a resource with apparently invalid hostname).

srflaxu40 commented 6 years ago

Yah, I am seeing only initially created records updated, and not existing.

linki commented 6 years ago

@lemaral That is correct and thanks for bringing it up.

I kind of disagree with @ideahitme here: A single broken Endpoint should not lead to the overall malfunction of ExternalDNS for literally all other modifications. There's no inherent reason why one broken Endpoint should effect other completely non-related Endpoints.

However, there's a strong connection between an Endpoint's A/CNAME/ALIAS record and its corresponding TXT ownership record, so these two must be modified in a transaction.

lemaral commented 6 years ago

@linki, thanks for the follow-up, indeed A/CNAME/ALIAS and the corresponding TXT record must stay in the same batch to keep operations atomic. My concern (as you highlighted it) is definitely about one failing endpoint stopping all further modifications. I understand the purpose of stacking multiple endpoints in the same batch for the purpose of optimisation, however in case of error the valid endpoints should still be processed correctly IMHO. If there is an agreement on this and no available bandwidth within the current contributors, I'm willing to take this and see how it can be done.

linki commented 6 years ago

@szuecs @Raffo @ideahitme @mikkeloscar Have a look @lemaral's comment and let's see if we can agree on this.

szuecs commented 6 years ago

@lemaral @linki the problem is more complicated. What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

Then I would also like to discuss, how the implementation should look like, because we have to be efficient in AWS calls (less AWS calls the better, because of rate limiting). I guess the same is true for other CloudProviders. I thought about split the changeset into 2 and apply both (binary search) and iterate on the errored one to split again until you found the broken record to mark the resource in kubernetes as "external-dns broken record".

@lemaral I would be happy to get PRs from you for this change, but please try to describe how this would work here, such that we can agree on this and might spot a design issue early.

mikkeloscar commented 6 years ago

What if you want to switch the record from one old working application to a new which is somehow broken or is the change within one transaction?

I might misunderstand the point you are making, but switching a record from one application to another would just mean changing the hostname of two resources (e.g. two ingresses) and external-dns would just pick this up and eventually arrive at the new desired state, which in this case might be a broken state if the hostname defined by the user is invalid.

lemaral commented 6 years ago

@szuecs, indeed binary search seems the most efficient wrt api rate limiting, we only need to ensure we're not looping forever. For the question you brought up, can you elaborate as I don't see why adding this error handling can make things worse.

szuecs commented 6 years ago

@lemaral Think about a changeset which will move a record 1 delete, 1 create. If we apply delete, then we get ratelimitted and then we do the create we will have created a downtime for our users.

gtie commented 6 years ago

This issue is has bitten us in an unexpected way. While a binary search would indeed be optimal wrt API calls usage, a simpler solution would be to implement it as a fallback strategy - try out a batch update first, and then switch to individual pairs of updates if the batch one fails

zioalex commented 6 years ago

Hi guys I have similar problem: time="2018-08-02T13:50:08Z" level=error msg="InvalidChangeBatch: The request contains an invalid set of changes for a resource record set 'A test.example..com.' status code: 400, request id: f7a6fff3-965a-11e8-8239-47abe2865262" Though I do not see any real problem with the changeset. And no other update can be done. Any update here?

szuecs commented 6 years ago

@zioalex which version you are using and can you post some logs when external-dns starts?

zioalex commented 6 years ago

Hi @szuecs follow some logs: time="2018-08-02T13:15:19Z" level=info msg="config: &{Master: KubeConfig: Sources:[service ingress] Namespace: FQDNTemplate: Compatibility: Provider:aws GoogleProject: DomainFilter:[] AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: Policy:upsert-only Registry: txt TXTOwnerID:my_domain.com TXTPrefix: Interval:1m0s Once:false DryRun:false LogFormat:text MetricsAddress::7979 Debug:false}" time="2018-08-02T13:15:19Z" level=info msg="Connected to cluster at https://100.64.0.1:443" time="2018-08-02T13:15:21Z" level=info msg="Changing records: CREATE { Action: "CREATE",

The actual version is v0.4.3

cristim commented 6 years ago

Since the error message given by the AWS provider contains the offender DNS record, I think we could just parse it, skip it from the batch and run another Upsert API call with the rest of the records in the initial batch.

cnuber commented 5 years ago

We are hitting this issue and currently have 57 ingresses that use the fqdns we are monitoring with external-dns. On smaller clusters we have typically 10-15 ingresses monitored and do not hit this issue.

szuecs commented 5 years ago

@cnuber thanks for adding more information here!

@linki @njuettner we should work on a fix the numbers are quite low, so I bet many people hit this issue.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

szuecs commented 5 years ago

/remove-lifecycle stale

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

szuecs commented 5 years ago

/remove-lifecycle stale

alfredkrohmer commented 4 years ago

Hey all. See my pull request #1209 for a potential fix (I also included example logs of the error case described in this issue).

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

szuecs commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/421#issuecomment-632950881): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

alfredkrohmer commented 3 years ago

/remove-lifecycle rotten

alfredkrohmer commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@devkid: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/421#issuecomment-813619494): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

peimanja commented 2 years ago

Can we open this?! it is still a valid issue

szuecs commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@szuecs: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/421#issuecomment-1324095379): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

szuecs commented 1 year ago

/important-soon

nitrocode commented 1 year ago

Can we reopen this issue please and exclude it from the bot?

It would be nice to have a flag such as --early-exit=true which defaults to true and can be overridden to false.

Edit: i see the implementation from https://github.com/kubernetes-sigs/external-dns/pull/1209 where it separates failed changes from successful changes and reruns the failed changes in a separate batch. Very clever! I'll try the latest version 0.13.5 again and see if it resolves my issue. Thank you

linki commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@linki: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/421#issuecomment-1669162111): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nitrocode commented 6 months ago

Is there a way to exclude this ticket from getting auto closed? It's a notable bug

nitrocode commented 6 months ago

/remove-lifecycle rotten

alfredkrohmer commented 6 months ago

I think this might be solved with #4166, no? At least it seem to be implemented already for AWS, OCI and Azure.

Liz4v commented 4 months ago

@alfredkrohmer Well it's more of a workaround than an error. Now it just loops these spurious records but still no explanation as to why it decided we need them. I have no ingress or service with an empty name, so why is it trying to do that?

time="2024-05-06T20:33:16Z" level=info msg="Applying provider record filter for domains: [green.staging.example.com. .green.staging.example.com.]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com A [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE .green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=info msg="Desired change: CREATE cname-.green.staging.example.com TXT [Id: /hostedzone/ZABCABCABC]"
time="2024-05-06T20:33:16Z" level=error msg="Failure in zone green.staging.example.com. [Id: /hostedzone/ZABCABCABC] when submitting change batch: InvalidChangeBatch: FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.green.staging.example.com'\n\tstatus code: 400, request id: 3065fc1c-06cc-4550-b230-652e7934d00f"
time="2024-05-06T20:33:17Z" level=error msg="Failed to do run once: soft error\nfailed to submit all changes for the following zones: [/hostedzone/ZABCABCABC]"

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nitrocode commented 1 month ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nitrocode commented 1 week ago

/remove-lifecycle rotten

kubernetes-sigs / external-dns

external-dns stops updating route53 on error #421