Closed AmitBenAmi closed 3 years ago
Seems that we don't have a mechanism to change the worker_rlimit_core
directive, we may add this feature, what's your idea?
That sounds good on its own, however, it won't make NGINX to create less core dumps, meaning that if I restrict it to 2MB, it can still create thousands of these dumps and still explode my filesystem (unless I set it to 0, meaning that no core dumps will be created, in which case I'm ignoring the problem rather noticing it exists)
That's like an internal bug of OpenSSL, it's difficult to troubleshoot it as the debug symbols were stripped. We may wait for a while and see whether somebody has some similar experiences, which might be useful.
would you be interested in showing what your cluster looks like with ;
- cat /proc/cpuinfo ; cat /proc/meminfo # from nodes
- helm ls -A
- kubectl get all,nodes,ing -A -o wide
- kubectl -n ingresscontrollernamespace describe po ingresscontrollerpod
- Get the nginx.conf from inside the pod and paste it here
- CPu/Memory/Inodes/Disk related status from your monitoring
@longwuyuan I don't want to expose that kind of information on my environment. If there is something more specific to this I can maybe share it, but this is a lot of information. I can say that my ingress pods didn't terminated, only created significant amount of core dumps
Maybe write very clear details about hardware, software, config and the list of commands etc that someone can execute, for example on minikube, to be able to reproduce this problem
I have no idea how to reproduce this.
My hardware is EKS (AWS EC2).
NGINX docker image is: v0.41.2
About configuration, I have thousands of ingresses that populate nginx.conf
automatically with hundreds of location and other nginx configurations.
Any idea how can I export a full dump interpretation on this to maybe help understand the problem?
not every AWS EKS user is reporting the same behaviour. There was one other issue reported stating core dumps. The best thought on that was to spread load. Any chance the problem is being caused by your use case only ?
/remove-kind bug /triage needs-information
I double-checked and the load isn't different or suddenly too immense.
I guess it is probably an error with something in my environment and not necessarily a bug in NGINX, but my nginx.conf consist of thousands of lines, @longwuyuan do you have any idea on where should I look for in the configuration itself?
You could be hitting a limit or memory violation, hard to tell which until the core backtrace is explicit. Your earlier post shows '?' symbol in gdb and then it shows crypto and then libssl. I am no developer so can't help much but I thought what someone said elsewhere, that '?' means you are missing symbols. And then crypto/ssl could mean all your TLS config was coming into play and nginx could not handle the size, as you say, you have thousands.
You can upgrade to most recent release of ingress-controller, check and verify, how to run gdb for nginx coredumps and post another backtrace that shows the size or any other details of that datastructure that its complaining about ;
Unexpected size of section `.reg-xstate/4969' in core file
Also you can try to replicate the size of objects in another cluster but try spreading the load.
@tokers has the option to set worker_rlimit_core
ever been added? We're now facing this issue and more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).
I realise ignoring the coredumps is hiding the issue, but in our scenario this would be much preferred to taking out the entire ingress with some misconfigured certs.
@mitom how did you come up with finding that the chain is the root cause? Is it something you found out from the core dumps themselves?
No, the core dumps only contained:
#0 0x00007fa81c6d9c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x00000000000000a0 in ?? ()
#2 0x00007fff2b051e20 in ?? ()
#3 0x00007fff2b051db0 in ?? ()
#4 0x0000000000000000 in ?? ()
which doesn't mean anything to me really.
It is more or less an educated guess based on that around the time we have this issue the logs were spammed with invalid certificate errors in the controller.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We have the same issue in coredump: https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-929951627 even with newer nginx and debian image.
more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).
We also use cert-manager, but have no any errors or unvalidated certs. There are no errors both in cert-manager logs and ingress-nginx, but worker still dies with worker process 931 exited on signal 11
.
Another thing i've notices is that nginx_ingress_controller_nginx_process_connections
only grows and never reduces:
And each of these small steps up - is worker die event. So per nginx stats there should be currently 30k active connections.
But if I login to this exact pod - there is only 2k:
$ k -n ingress-nginx exec -it ingress-nginx-controller-5cf78859f4-7l9cc -- bash
bash-5.1$ netstat -tn | wc -l
2351
We cannot submit this issue to nginx upstream, because ingress-nginx compiles nginx from sources with additional plugins and patches. Also i have pretty limited knowledge of gdb and debug symbols, so was unable to find them for libssl both on alpine and debian to fix this part in coredump:
#2 0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3 0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4 0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5 0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1
Any help would be greatly appreciated.
Hey @sepich thanks. I will start digging now into openssl problems, as we could remove the openresty bug.
Are you using NGINX v1.0.2?
Can you provide me some further information about the size of your environment, amount of Ingresses, amount of different SSL certificates?
Thanks
Are you using NGINX v1.0.2?
No we are still on k8s 1.19 and so ingress-nginx v0.49.2
size of your environment
From https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-928138962
It is just 215 Ingress objects / 80 rps, 3 ingress-nginx pods with 5% cpu load each.
99% of ingresses are SSL, so I would say it is 215 certs also. This number is pretty stable, not like ingresses are created and deleted each 5 min. More like once per week.
Ok, thanks! Will check ASAP :)
I'm wondering if this patch (https://github.com/openresty/openresty/blob/master/patches/openssl-1.1.1f-sess_set_get_cb_yield.patch) which is applied by Openresty shouldn't be applied in OpenSSL as well.
@sepich in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?
Hi @sepich , I have sent you an email to arrange a call with an interactive gdb session as said here. https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-929951627
Thanks very much!
@rikatz, great finding! This patch originally was created as two parts for both nginx and openssl: https://github.com/openresty/openresty/commit/97901f335709e8f3b2dec1c368bba20f3894fccc
https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/ssl/session.md#description
This Lua API can be used to implement distributed SSL session caching for downstream SSL connections, thus saving a lot of full SSL handshakes which are very expensive.
I've checked that no ngx.ssl.session
, ssl_session_fetch_by_lua*
and ssl_session_store_by_lua*
is being used in ingress-nginx. We also do not use any Lua code in ingress snippets. So, I've deleted images/nginx/rootfs/patches/nginx-1.19.9-ssl_sess_cb_yield.patch
file (to avoid rebuilding openssl), then rebuild nginx and v0.49.2. But the issue and coredump backtrace is the same:
#5 0x00007fdb5619efb4 in ?? () from /lib/libssl.so.1.1
#6 0x0000562d5b8b8c68 in ngx_ssl_handshake (c=c@entry=0x7fdb55a2fa20) at src/event/ngx_event_openssl.c:1720
#7 0x0000562d5b8b9081 in ngx_ssl_handshake_handler (ev=0x7fdb5588a0c0) at src/event/ngx_event_openssl.c:2069
But there is one more patch for ngx_event_openssl.c
- nginx-1.19.9-ssl_cert_cb_yield.patch
:
https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block
Checked that ingress-nginx lua code does not use this, and rebuild image without this patch too.
But issue still remains.
Looks like I misunderstood something, maybe you can build some test image with minimum amount of patches only to make ingress-nginx-controller working?
@doujiang24, got it!
Yeap, I can.
Actually I already have a base image with the right patches, and proper linking:
ldd /sbin/nginx |grep ssl libssl.so.1.1 => /usr/local/openresty/openssl111/lib/libssl.so.1.1 (0x7f7be72b7000) libcrypto.so.1.1 => /usr/local/openresty/openssl111/lib/libcrypto.so.1.1 (0x7f7be6fc1000)
I have published this base image in rpkatz/nginx:patchedopenresty so you can build your own controller using it, for example, in case of "legacy/0.49x" branch:
On Fri, Oct 1, 2021 at 9:56 AM Alex R @.***> wrote:
@rikatz https://github.com/rikatz, great finding! This patch originally was created as two parts for both nginx and openssl: @.*** https://github.com/openresty/openresty/commit/97901f335709e8f3b2dec1c368bba20f3894fccc
https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/ssl/session.md#description
This Lua API can be used to implement distributed SSL session caching for downstream SSL connections, thus saving a lot of full SSL handshakes which are very expensive.
I've checked that no ngx.ssl.session, ssl_session_fetch_by_lua and ssl_session_store_by_lua is being used in ingress-nginx. We also do not use any Lua code in ingress snippets. So, I've deleted images/nginx/rootfs/patches/nginx-1.19.9-ssl_sess_cb_yield.patch file (to avoid rebuilding openssl), then rebuild nginx and v0.49.2. But the issue and coredump backtrace is the same:
5 0x00007fdb5619efb4 in ?? () from /lib/libssl.so.1.1
6 0x0000562d5b8b8c68 in ngx_ssl_handshake @.***=0x7fdb55a2fa20) at src/event/ngx_event_openssl.c:1720
7 0x0000562d5b8b9081 in ngx_ssl_handshake_handler (ev=0x7fdb5588a0c0) at src/event/ngx_event_openssl.c:2069
But there is one more patch for ngx_event_openssl.c - nginx-1.19.9-ssl_cert_cb_yield.patch: https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block Checked that ingress-nginx lua code does not use this, and rebuild image without this patch too. But issue still remains. Looks like I misunderstood something, maybe you can build some test image with minimum amount of patched only to make ingress-nginx-controller working?
@doujiang24 https://github.com/doujiang24, got it!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/7080#issuecomment-932202647, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWZQBK62OOFF2X5O3JM7O3UEWVYTANCNFSM43U7BRIA .
/remove-lifecycle stale
you can build your own controller using it
Thank you, unfortunately it still fails (v0.49.2 on top of it):
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
[Current thread is 1 (LWP 69)]
(gdb) bt
#0 0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#1 0x00007f1b4819b86f in OPENSSL_LH_doall_arg () from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1
#2 0x00007f1b4834cdb7 in SSL_CTX_flush_sessions () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#3 0x00007f1b48367d55 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#4 0x00007f1b4835923d in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#5 0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#6 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#7 0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#8 0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#9 0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#10 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#11 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#12 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#13 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
Is it now possible to load openssl debug symbols somehow?
@sepich The debug symbol package for openresty-openssl111 is openresty-openssl111-dbg
.
You can try to install it by apk add openresty-openssl111-dbg
.
Thanks, it is:
echo https://openresty.org/package/alpine/v3.14/main >> /etc/apk/repositories
apk --allow-untrusted add openresty-openssl111-dbg
seems to be working:
(gdb) bt
#0 0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
#1 0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
at crypto/lhash/lhash.c:196
#2 OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
#3 0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
#4 SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
#5 0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
#6 0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
#7 0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
#8 state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
#9 0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#11 0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#13 0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#14 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#15 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#16 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#17 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
(gdb) bt full
#0 0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
No locals.
#1 0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
at crypto/lhash/lhash.c:196
i = 1781
a = <optimized out>
n = 0x0
i = <optimized out>
a = <optimized out>
n = <optimized out>
#2 OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
No locals.
#3 0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
No locals.
#4 SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
i = 256
tp = {ctx = 0x7f1b45a11390, time = 1633097512, cache = 0x7f1b45a1ab50}
#5 0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
stat = <optimized out>
i = <optimized out>
#6 0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
tctx = <optimized out>
tick_nonce = "\000\000\000\000\000\000\000\001"
age_add_u = {age_add_c = "Ɩ\240", <incomplete sequence \350>, age_add = 3902838470}
err = <optimized out>
#7 0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
post_work = 0x7f1b48368c10 <ossl_statem_server_post_work>
mt = 4
pkt = {buf = 0x7f1b399499b0, staticbuf = 0x0, curr = 57, written = 57, maxsize = 18446744073709551615, subs = 0x7f1b371d0b20}
ret = <optimized out>
pre_work = 0x7f1b483689e0 <ossl_statem_server_pre_work>
get_construct_message_f = 0x7f1b48368fd0 <ossl_statem_server_construct_message>
confunc = 0x7f1b483675f0 <tls_construct_new_session_ticket>
st = 0x7f1b391c39c8
transition = 0x7f1b48368580 <ossl_statem_server_write_transition>
cb = 0x561c111a4491 <ngx_ssl_info_callback>
st = <optimized out>
ret = <optimized out>
transition = <optimized out>
pre_work = <optimized out>
post_work = <optimized out>
get_construct_message_f = <optimized out>
cb = <optimized out>
confunc = <optimized out>
mt = <optimized out>
pkt = {buf = <optimized out>, staticbuf = <optimized out>, curr = <optimized out>, written = <optimized out>, maxsize = <optimized out>, subs = <optimized out>}
#8 state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
buf = 0x0
cb = 0x561c111a4491 <ngx_ssl_info_callback>
st = <optimized out>
ret = <optimized out>
ssret = <optimized out>
#9 0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
n = <optimized out>
sslerr = <optimized out>
err = <optimized out>
rc = <optimized out>
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
c = 0x7f1b47b58b78
@sepich Great, the timeout_cb
is the first key function we want to verify first.
@rikatz, Could you please share how to build image like rpkatz/nginx:patchedopenresty
?
(@doujiang24 asks to add some debugging to openssl and test)
While recompiling openssl I've found workaround for this issue - edit nginx.conf
ssl_session_cache builtin:1000 shared:SSL:10m;
and drop builtin:1000
. From docs:
builtin a cache built in OpenSSL; Use of the built-in cache can cause memory fragmentation.
using only shared cache without the built-in cache should be more efficient.
Unfortunately it is not exposed via some annotation, so have to edit template. There is even SO article for this.
Interesting to know why builtin:1000
is hardcoded. I understand that this is not a fix for openssl issue, but maybe drop builtin
from template for everybody, as it stated in docs?
Hello, @sepich Glad you find a way to avoid segfault in #7777 . But I think it may be a workaround, may not be a proper fix. According to Nginx doc, use "builtin" and "shared" at the same time should be supported: http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache
Unfortunately, however, after talking to OpenSSL and Nginx team, I still can not find where the bug is. https://github.com/openssl/openssl/issues/16733#issue-1014329932 http://mailman.nginx.org/pipermail/nginx-devel/2021-October/014372.html
Hello @rikatz Maybe you can help to expose an annotation to enable "builtin" and disable it by default? So that one could try to reproduce it if they are still interesting to fix the bug. I'm not sure if it is worthing, use "builtin" and "shared" may not be a good choice usually.
yeah sure, I will open a new PR and add that as a configuration :)
NGINX Ingress controller version:
v0.41.2
Kubernetes version (use
kubectl version
):Environment:
What happened: My NGINX ingress controllers started to created endless core dump files. This started to fill up some of my nodes' filesystem, creating disk-pressure on them and started to evict other pods. I do not have any debug log set up or intentionally configured to create core dumps with NGINX.
What you expected to happen: Not sure if preventing core dumps is the right way, gdb output in the bottom.
How to reproduce it: Not sure I understand why it happens now. We do have autoscaling enabled and I don't think we reach the resource limits, so not sure why it happens.
Anything else we need to know: I managed to copy the core dump, and tried to investigate it, but couldn't find anything verbose about it:
In the meantime, I added a
LimitRange
for default limit ofephemeral-storage
of10Gi
to prevent it to reach max node storage (my pods reached ~60Gi
storage usage only from core dumps)/kind bug