Closed lucianjon closed 1 month ago
The behavior changed here https://github.com/kubernetes/ingress-nginx/pull/4671
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@fejta-bot: Closing this issue.
Yep, this just got me too while working on a new cluster. Nginx Ingress essentially DOSed CoreDNS, which caused all kinds of wierdness in the cluster.
Edit: Running k8s.gcr.io/ingress-nginx/controller:v1.1.1
I am getting this issue too.
Running k8s.gcr.io/ingress-nginx/controller:v1.1.0
I'm also affected by this issue. Hope on some activity on it. /reopen
@VsevolodSauta: You can't reopen an issue/PR unless you authored it or you are a collaborator.
I'm also getting this issue: k8s.gcr.io/ingress-nginx/controller:v1.2.0
Same issue.
Why is this closed?
I am also seeing the same issue - has anyone here resolved it or has a workaround?
Even without kubernetes, if a process makes calls to unresolvable hostname in a infinite loop, then there will be impact.
Thanks, ; Long
On Mon, 3 Oct, 2022, 4:32 PM dexterlakin-bdm, @.***> wrote:
Why is this closed?
I am also seeing the same issue - has anyone here resolved it or has a workaround?
— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/6523#issuecomment-1265275394, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWUGAXBN7GR4KZYGANLWBK4L7ANCNFSM4UC4AJOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Same issue
dns.lua:152: dns_lookup(): failed to query the DNS server for
I am also having the same issue with v1.3.1 in some clusters
Same problem, keep watching
+1
We are experiencing identical issues on both GKE and AKS clusters while using ingress-nginx versions 1.9.1 and 1.9.3.
Occasionally, we encounter situations where the backend resides outside the cluster. The "ExternalName" record is dynamically resolved using endpoints controlled by Consul. However, if it happens to be a single backend service or the last one, and it deregisters due to reasons such as a reboot, the "ExternalName" encounters a non-existing CNAME record. This, in turn, causes ingress-nginx to goes completely crazy with such errors:
2023/10/26 18:16:18 [error] 432#432: *18134 [lua] dns.lua:152: dns_lookup(): failed to query the DNS server for my-not-existing-record.example.com:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer
In situations where there are only a few occurrences, this behavior can sometimes be obscured by the sheer volume of logs. However, when a substantial number of endpoints become unreachable all at once, compounded by the current scale of Ingress-NGINX pods (which, in our scenario, includes both internal and external-facing ingress classes), the problem escalates significantly and places a severe burden on our coreDNS server, potentially overwhelming them.
What I would like to see is a restriction on the number of resolve attempts / limiting resolve-retry rates or, even more desirable, the implementation of a back-off mechanism.
We're experiencing same behavior. With a few 'invalid' or 'temporaty invalid' svc ExternalName backend configurations we noticed a tons of messages like this and huge amount of DNS calls.
We tested the same scenario with traefik as a ingress controller - no issue at all, just 502 response on the client call.
/reopen
@tao12345666333: Reopened this issue.
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
@neerfri: You can't reopen an issue/PR unless you authored it or you are a collaborator.
@tao12345666333 if you can please reopen the ticket.
I have a PR to fix this that is almost ready to be merged waiting for approval from @grounded042 https://github.com/kubernetes/ingress-nginx/pull/10989
Thanks!
/reopen
On Tue, 17 Sept, 2024, 15:37 Neer Friedman, @.***> wrote:
@tao12345666333 https://github.com/tao12345666333 if you can please reopen the ticket.
I have a PR to fix this that is almost ready to be merged waiting for approval from @grounded042 https://github.com/grounded042
10989 https://github.com/kubernetes/ingress-nginx/pull/10989
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/6523#issuecomment-2355163847, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWR3OK2UZAWMOCEARVDZW75P7AVCNFSM4UC4AJO2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZVGUYTMMZYGQ3Q . You are receiving this because you commented.Message ID: @.***>
@longwuyuan: Reopened this issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
@longwuyuan / @tao12345666333 If you can please reopen the ticket. The related PR is still waiting for final review.
Thanks
We're having this issue too. And beside the DNS load (which stays reasonable here) it also has the side effect of overloading the nginx processes themselves, resulting in very high CPU usage for a very low traffic.
Unfortunately, we don't have anyone with necessary C/C++ skills to try and look into the issue onboard and rely on volunteer maintainers of this project to assist.
We stand available for testing if needed.
NGINX Ingress controller version: v0.41.2
Kubernetes version (use
kubectl version
):Environment:
uname -a
):Linux ip-10-60-10-234 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:19:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
What happened:
If an ingress definition is created that points to an ExternalName service, which in turn produces a DNS lookup error, an endless loop of DNS requests is created that can bring the system down.
We noticed this when migrating from v0.19.0 -> v0.41.2, we have both controllers running in parallel. One of our teams was prepping for this and creating routes that pointed to yet to be created DNS records. It appears the old controllers were unaffected but there was huge amounts of DNS lookups generated by the routes on the new controller. It doesn't require actual requests to the routes, just creating the ingress and service definition is enough.
Eventually this overwhelmed dnsmasq and brought down our cluster's DNS, the concurrent requests were limited by dnsmasq but we were looking at thousands of requests per second. Was there some behaviour change between the two versions that could introduce this behaviour and is this expected? My naive guess is there would typically be some kind of exponential backoff on a DNS lookup error.
This is the error produced by the controller:
What you expected to happen:
DNS lookup failures to be handled with some form of backoff.
How to reproduce it:
These two definitions should be enough to reproduce the issue, assuming a proper class and namespace:
/kind bug