Closed ReillyTevera closed 3 years ago
Not sure if it was caused by the base image upgrade (alpine 3.13) but we have seen the same issue as well in our environment after upgrading to 0.44.0. It had happened for 4 times in two days before we downgraded to 0.43.0. We haven't seen a single instance after the downgrade.
All 4 incidents were started with nginx config reload
First we saw very high CPU loads on nodes hosting Nginx pods. Those nodes have 128 cores and the loads got up to 1k-2k:
According to Nginx logs, every one of those load spikes were followed by a config reload:
I0225 17:59:31.825591 6 controller.go:146] "Configuration changes detected, backend reload required"
I0225 17:59:32.429276 6 controller.go:163] "Backend successfully reloaded"
I0225 17:59:32.429503 6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-68974744d7-7v54w", UID:"668e9e5b-37e7-4493-ba37-51b10522c7b0", APIVersion:"v1", ResourceVersion:"1342977486", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
2021/02/25 17:59:36 [alert] 63#63: worker process 35497 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34639 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34938 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34249 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 35369 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34430 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 35181 exited on signal 11 (core dumped)
2021/02/25 17:59:36 [alert] 63#63: worker process 34138 exited on signal 11 (core dumped)
At this point, readiness/liveness probes became a hit or miss, if it was lucky enough to hit a liveness failure threshold, kubelet will restart the pod and the system load would recover back to normal. Otherwise, it could affect other pods on the same node, or even knock down the whole node.
FWIW whilst we haven't seen segfaults, we have noted quite a cpu jump from 0.42
-> 0.44
:
You can see it here with our deployment at 11:10:
Interestingly our lower traffic qualifying environments observed no increase in CPU, so this only seem to be an "under load" thing.
Will switch to 0.43 and report back.
Yup can confirm 0.43 is back to expected levels.
I also have similar issue. Recently several nodes get NodePressure status and pods getting Evicted, because ingress controller generate so many core-dumps. So many, the total is almost 200GB.
Using gdp here what i got
bash-5.1# gdb /sbin/nginx core.904
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/nginx...
warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
warning: Can't open file /[aio] (deleted) during file-backed mapping note processing
[New LWP 904]
[New LWP 913]
[New LWP 936]
[New LWP 907]
[New LWP 963]
[New LWP 915]
[New LWP 947]
[New LWP 934]
[New LWP 917]
[New LWP 941]
[New LWP 928]
[New LWP 951]
[New LWP 932]
[New LWP 953]
[New LWP 939]
[New LWP 955]
[New LWP 943]
[New LWP 909]
[New LWP 945]
[New LWP 961]
[New LWP 949]
[New LWP 957]
[New LWP 938]
[New LWP 959]
[New LWP 964]
[New LWP 930]
[New LWP 926]
[New LWP 925]
[New LWP 919]
[New LWP 921]
[New LWP 923]
[New LWP 905]
[New LWP 911]
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `nginx: master process /usr/local/nginx/sbin/nginx -c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fa2ed778c59 in ?? () from /lib/ld-musl-x86_64.so.1
[Current thread is 1 (LWP 904)]
(gdb) backtrace
#0 0x00007fa2ed778c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x00000000000000a0 in ?? ()
#2 0x00007ffd3bb46230 in ?? ()
#3 0x00007ffd3bb461c0 in ?? ()
#4 0x0000000000000000 in ?? ()
After cleaning up, it takes only several hours and the node get DiskPressure again. Could you help, how can I troubleshoot further to find the cause ?
@alfianabdi Well, first of all the ??
in your backtrace indicates that you don't have all of the debugging symbols installed.
Run the following to install the musl debug symbols prior to running that gdb command:
apk add musl-dbg
@ReillyTevera
Thanks, finally got it, somehow does not work in arm64 node. Here what I got from backtrace
#0 a_crash () at ./arch/x86_64/atomic_arch.h:108
#1 get_nominal_size (end=0x7f2d94edb90c "", p=0x7f2d94eda790 "") at src/malloc/mallocng/meta.h:169
#2 __libc_free (p=0x7f2d94eda790) at src/malloc/mallocng/free.c:110
#3 0x00007f2d991acf7b in lj_vm_ffi_call () from /usr/local/lib/libluajit-5.1.so.2
#4 0x00007f2d991f3077 in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1382
#5 0x00007f2d9920938d in lj_cf_ffi_meta___call (L=0x7f2d951c9380) at lib_ffi.c:230
#6 0x00007f2d991aab45 in lj_BC_FUNCC () from /usr/local/lib/libluajit-5.1.so.2
#7 0x00007f2d991bd8ff in lua_pcall (L=L@entry=0x7f2d951c9380, nargs=nargs@entry=0, nresults=nresults@entry=0, errfunc=errfunc@entry=10) at lj_api.c:1140
#8 0x00005587e51718aa in ngx_http_lua_do_call (log=log@entry=0x7f2d98da2568, L=L@entry=0x7f2d951c9380)
at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_util.c:4233
#9 0x00005587e51888ce in ngx_http_lua_init_worker_by_inline (log=0x7f2d98da2568, lmcf=<optimized out>, L=0x7f2d951c9380)
at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_initworkerby.c:323
#10 0x00005587e5188786 in ngx_http_lua_init_worker (cycle=0x7f2d98da2550)
at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_initworkerby.c:296
#11 0x00005587e50b20ab in ngx_worker_process_init (cycle=cycle@entry=0x7f2d98da2550, worker=<optimized out>) at src/os/unix/ngx_process_cycle.c:955
#12 0x00005587e50b26d0 in ngx_worker_process_cycle (cycle=0x7f2d98da2550, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:759
#13 0x00005587e50b06f1 in ngx_spawn_process (cycle=cycle@entry=0x7f2d98da2550, proc=0x5587e50b26af <ngx_worker_process_cycle>, data=0x4,
name=0x5587e51d7f4f "worker process", respawn=respawn@entry=5) at src/os/unix/ngx_process.c:199
#14 0x00005587e50b1885 in ngx_reap_children (cycle=cycle@entry=0x7f2d98da2550) at src/os/unix/ngx_process_cycle.c:641
#15 0x00005587e50b34fe in ngx_master_process_cycle (cycle=0x7f2d98da2550, cycle@entry=0x7f2d98da7210) at src/os/unix/ngx_process_cycle.c:174
#16 0x00005587e5085ad9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
@alfianabdi That's the same stacktrace I posted in my original comment and unfortunately it doesn't give us (or the ingress-nginx maintainers) any new information to help debug the issue. It does confirm that you are experiencing the same issue as I was at least (and presumably the same as the rest of the people in this issue even though they haven't posted backtraces to confirm).
I just checked the latest changelogs for Alpine 3.13.x, musl, and the newest nginx version and nothing in them looks like it could be helpful. I would not expect this to be resolved with an upcoming ingress-nginx image (unless the issue was caused by something transient in the build).
Pinging the following (mentioned in the owners file) for visibility.
Same issue here with 0.44.0. Higher loaded clusters are affected more often. Is this resolved with 0.45.0 (don't think so accoring to the changelog)?
Same issue here with 0.44.0. Higher loaded clusters are affected more often. Is this resolved with 0.45.0 (don't think so accoring to the changelog)?
Yeah we're quite loaded. Each instance doing around 400 ops/sec? We never saw the seg fault but observed almost double the load (cpu) for the same ops, until we rolled back.
Maybe slightly off topic, as I don't know if the CPU spikes we saw were 100% caused by something in alpine, but would it be worth to provide a debian based image as an option? From what I gathered from #6527, the motivations were:
But would the same goals still be achievable with some trimmed down versions of debian? Like distroless?
Although I still love alpine for a lot of things, I have also moved away from it for many projects due to some famous issues like performance hit (mainly because musl libc I think), networking and DNS problems(e.g. https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues)
I haven't tried 0.45.0, but I just compared the image with the 0.44.0 one and the musl library, the nginx binary, and the libluajit library all hashed out to the same as the 0.44.0 ones. I would be very surprised if the issue was resolved given that the issue is most likely in one of those.
Unfortunately though I don't think I can be of much more assistance in helping debug this issue. We ended up switching to Traefik for our ingress controller because of this issue (and also because Traefik doesn't close TCP connections when it reloads its config). We no longer have any ingress-controller deployments running at all and have no plans to switch back even if this issue is fixed.
Can the coredump be reproduced at will ? (For example by traffic to the controller on a kinD or a minikube cluster on laptop)
Followup got coredump with v0.44.0 and v0.45.0 centos 7 docker 19.03 kernel 4.20.13 from elrepo
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v0.45.0
Build: 7365e9eeb2f4961ef94e4ce5eb2b6e1bdb55ce5c
Repository: https://github.com/kubernetes/ingress-nginx
I0420 15:56:09.765547 6 flags.go:208] "Watching for Ingress" class="nginx"
W0420 15:56:09.765622 6 flags.go:213] Ingresses with an empty class will also be processed by this Ingress controller
nginx version: nginx/1.19.6
-------------------------------------------------------------------------------
I0420 15:56:15.300180 6 controller.go:163] "Backend successfully reloaded"
I0420 15:56:15.300272 6 controller.go:174] "Initial sync, sleeping for 1 second"
I0420 15:56:15.300626 6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-2g4fc", UID:"80460d24-5a6c-4524-8c21-ab08140e6efb", APIVersion:"v1", ResourceVersion:"19639245", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
2021/04/20 15:56:17 [alert] 26#26: worker process 39 exited on signal 11 (core dumped)
2021/04/20 15:56:17 [alert] 26#26: worker process 40 exited on signal 11 (core dumped)
2021/04/20 15:56:17 [alert] 26#26: worker process 38 exited on signal 11 (core dumped)
2021/04/20 15:56:18 [alert] 26#26: worker process 139 exited on signal 11 (core dumped)
and etc....
@longwuyuan I don't know that anyone has reproduced this in kinD or minikube. If this issue is happening though for someone it is fairly consistent. Have have multiple k8s clusters that are fairly identical and the issue was present in all of them (the segfaults just happen at a reduced rate on clusters that don't process as much traffic).
@LuckySB Can I ask why you're using that kernel? 4.20 has been end-of-life since March of 2019, it is very insecure to be using it now. I see that 5.4 is in elrepo, if I were you I'd just use that as it's a LTS kernel and is supported until Dec 2025.
any chance of anyone posting a fairly reasonable step by step process to reproduce this problem. Particularly hoping OS, Networking and such information and specifications nailed down for reproducing. Because I am not able to.
/remove-kind bug
/triage needs-information /triage not-reproducible
@longwuyuan Multiple people have provided stacktraces and additionally have full nginx worker coredumps that they can provide to ingress-nginx core developers (obviously they are sensitive files). I suppose I'm curious as that is not sufficient?
Hi @ReillyBrogan sorry to hear that you are no longer using ingress-nginx.
There is a topic in the upcoming sig-network to figure out how this project can get the appropriate amount of bandwidth from the community.
@ReillyBrogan I am not able to reproduce. The info available hints at a combination of kernel version of the node, certain volume of traffic so I am guessing cpu/mem available etc.
I think the issue is related to amount of traffic combined with frequent reloads of nginx not certainly with cpu/mem available.
The cluster where we saw the issue very frequently (~50 restarts in 3 days on all 4 ingress pods) was a dev cluster with very frequent config changes and quite a lot of traffic through the ingress controller. Our kubelets in this cluster (55 in total) all have 24 CPUs together with 64GB of memory and an average usage of around 60% (cpu/mem). OS is a RHEL 7.9 with kernel 5.10.15-1.el7.elrepo.x86_64.
Do you have metrics of inodes, filehandles, conntrack and such resources on the node where the pod was running at the time of the segfault
Hey; as the segfaults are relatively infrequent and difficult to reproduce - shouldn't we be working with data that's more readily accessible? As I demonstrated above, we're able to observe a loosely double CPU usage between 0.43 and 0.44, it's not a huge leap to say that whatever is causing that additional load is only going to exacerbate config reloads (already a high CPU event).
The CPU increase should be relatively trivial to reproduce. In the above example we're running 6 pods with 500m CPU (no limits) with each pod doing around 250-300ops/sec.
I can confirm we saw the same issue on AKS with version 0.45.0. Issue went away when we downgraded to 0.43.0.
I can confirm we saw the same issue on GCP/GKE with version 0.45.0. Issue also went away with 0.43.0.
From our Compute engine, we also found that:
[39968.569424] traps: nginx[1006038] general protection fault ip:7f6a54085c59 sp:7ffc2d3b6230 error:0 in ld-musl-x86_64.so.1[7f6a54077000+48000]\r\n
On this cluster, we have a lot of ingress (~200). We didn't see this issue on a similar cluster with quite similar ingress volume.
Just learned about this issue the hard way! I confirm the issue is present on 0.46.0 as well. Planning a downgrade till the issue is fixed.
If I were someone interested in working on this (don't really have time ATM for something that we no longer use) my next steps would be to build a new Alpine source container using a musl library compiled from git source, and then use that source image to build a nginx-controller image and test that. There are a number of commits since the musl 1.22. release that mention changes to malloc in some way and there is a decent chance that one of them fixes this issue.
Just wanted to comment here that we are also experiencing this issue. Using v0.46.0
, kernel 4.19.0-12-amd64
, debian 10.6.
/kind bug
Hey folks, we've just released controller v0.47.0 -> https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.47.0
Can you please confirm this still happens? It uses now nginx v1.20.1.
We are discussing again if we should start using Debian-slim instead of Alpine (Ingress NGINX already did that on past).
Thanks
/remove-triage needs-information /remove-triage not-reproducible /triage accepted
We see this issue as well on 0.45. On Kubernetes clusters which frequently add ingresses (i.e. build systems) we see this multiple times per hour. Never seen this on production clusters with infrequent new ingresses. Will try 0.47 over the next days.
I will do the same on a test cluster and report back.
Still happens on 0.47 for us. We got two crashes within a few hours. Error:
I0608 05:45:39.848955 6 controller.go:163] "Backend successfully reloaded"
I0608 05:45:39.849571 6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"xxx", Name:"nginx-ingress-controller-57fb9bd94d-54545", UID:"20a138f0-3f03-40aa-b0ae-d6fb45ed92b5", APIVersion:"v1", ResourceVersion:"796859441", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0608 05:45:39.858318 6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
W0608 05:45:39.858334 6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
W0608 05:45:39.858393 6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
2021/06/08 05:45:41 [alert] 28#28: worker process 38939 exited on signal 11 (core dumped)
2021/06/08 05:45:41 [alert] 28#28: worker process 38938 exited on signal 11 (core dumped)
2021/06/08 05:45:42 [alert] 28#28: worker process 39006 exited on signal 11 (core dumped)
We also still have the issue on 0.47. Multiple crashes on ingresses patch/create:
2021/06/07 12:27:41 [alert] 59#59: worker process 700 exited on signal 11 (core dumped)
156095:Jun 7 12:27:09 XXXXXXXXX kernel: [14957845.655837] traps: nginx[15990] general protection ip:7f19693e2c59 sp:7fff5cf74be0 error:0 in ld-musl-x86_64.so.1[7f19693d4000+48000]
We have a lot of ingresses (280) and frequent patch/create (development server). Not a single crash since rollback to 0.43.
Is there anything we can do to help here? I'm thinking of building my own ingress-nginx image based on debian-slim and do some tests with that image. Because I think this issue maybe related to musl-libc in alpine 🤷🏻♂️
It could also be a good idea to try custom builds with newer LuaJIT or Alpine versions. Or try making images from commits between the 0.43.0 and 0.44.0 releases and verify that it really was the Alpine update that made the segfaults appear. These could be quick to try before making the Debian image?
We're running 0.46.0 in a single dev environment, and have only seen one segfault in about a month. All our other environments are still on 0.43.0. I haven't been able to reproduce the segfaults more frequently, so I do not have a way to test any custom builds.
In Alpine's 3.14 release they changed to use OpenResty's LuaJIT (https://gitlab.alpinelinux.org/alpine/aports/-/commit/c12fb28e6d794fa1cb9ceda035d06edbff432c29), noting that the previous LuaJIT implementation caused segfaults with some Lua modules. As ingress-nginx already uses the OpenResty LuaJIT fork, this likely is a different issue, but it might be worth to try upgrade LuaJIT anyway.
Hi, we have seen this problem in our development cluster with 0.46 quite often, so we downgraded the image to 0.43. We are still seeing this error in the logs from the 0.43 ingress after downgrading. Our ingress controller serves about 500+ ingress resources. If we can add anything to trace the cause of this issue pleas let us know.
hey folks,
Thanks for your help and patience.
Yes, maybe testing with the new alpine might be something. Alpine 3.14 got some changes in syscalls that might lead to other problems (we discussed that in #ingress-nginx-dev on slack) so we are really considering moving to Debian slim in future releases as well.
I'm pretty low bandwidth today, but I can release in my personal repo a ingress-nginx image using alpine 3.14 if someone wants to test, and maybe create an alternative Debian release so you can test, sounds good?
About LuaJIT I'll defer to @moonming and @ElvinEfendi
/priority critical-urgent Let's check this next and maybe try to create two kinds of image (debian and alpine) to check the differences :)
hey folks,
Thanks for your help and patience.
Yes, maybe testing with the new alpine might be something. Alpine 3.14 got some changes in syscalls that might lead to other problems (we discussed that in #ingress-nginx-dev on slack) so we are really considering moving to Debian slim in future releases as well.
I'm pretty low bandwidth today, but I can release in my personal repo a ingress-nginx image using alpine 3.14 if someone wants to test, and maybe create an alternative Debian release so you can test, sounds good?
About LuaJIT I'll defer to @moonming and @ElvinEfendi
Just leave a comment. Although I can't bring you some useful debugging information, some information is available for reference.
I went to check other Nginx-based projects (APISIX / APISIX Ingress Controller) that use Alpine 3.13
as the base image , also through some stress tests to simulate the environment, but unfortunately, no similar signal 11 errors were found.
If there are other special scenes that are convenient to reproduce the problem, please let me know.
luaJIT version: 2.1 baseImage: alpine 3.13
My company using 0.46 on the production cluster which has high usage but no issue at all.
But on the development cluster dose, the only difference is ingresses on the dev cluster are much more and keep reloading more frequent
My company using 0.46 on the production cluster which has high usage but no issue at all.
But on the development cluster dose, the only difference is ingresses on the dev cluster are much more and keep reloading more frequent
It seems there is nothing to do with base image, but is related to the user’s usage scenarios. and encounters problems with frequent configuration updates and reload.
I suspect the same so a stack trace of the dumped core comes to mind.
Thanks, ; Long
On Tue, 22 Jun, 2021, 8:02 AM kv, @.***> wrote:
My company using 0.46 on the production cluster which has high usage but no issue at all.
But on the development cluster dose, the only difference is ingresses on the dev cluster are much more and keep reloading more frequent
It seems there is nothing to do with base image, but is related to the user’s usage scenarios. and encounters problems with frequent configuration updates and reload.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-865476450, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWXG7JG4ATEXJ3TNC5DTT7Y4ZANCNFSM4YBOLT3A .
@longwuyuan There's a backtrace in the first comment
We encounter the same issue on EKS with 0.44.0. We thought that downgrading to 0.43.0 will fix the issue but it didn't. The issue doesn't seem to happen in our preprod env while the only difference is the load
Hi folks.
We are planning some actions on this:
To help me reproduce this:
fyi this is the PR updating luajit and everything else: https://github.com/kubernetes/ingress-nginx/pull/7411
We have issue with our big cluster (633 ingress). It segfault at startup, without load (load was not transferred yet) Other clusters are fine (<200 ingress) Same version of K8S (EKS 1.19)
@laghoule thanks! This helps me to generate some scenarios and simulate!
@rikatz as I mentioned above, we can observe notably high CPU under load (almost double) so you should be able to reproduce that relatively easily with some load testing and then switching versions (https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-788886910).
NGINX Ingress controller version: 0.44.0
Kubernetes version (use
kubectl version
): 1.18.3 Environment:What happened: We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:
We discovered that the following message was being printed to the system log as well (timed with the worker exits):
I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.
We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.
I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:
One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?
/kind bug