kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.36k stars 8.23k forks source link

NGINX ingress creating endless core dumps #7080

Closed AmitBenAmi closed 3 years ago

AmitBenAmi commented 3 years ago

NGINX Ingress controller version: v0.41.2

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Environment:

What happened: My NGINX ingress controllers started to created endless core dump files. This started to fill up some of my nodes' filesystem, creating disk-pressure on them and started to evict other pods. I do not have any debug log set up or intentionally configured to create core dumps with NGINX.

What you expected to happen: Not sure if preventing core dumps is the right way, gdb output in the bottom.

How to reproduce it: Not sure I understand why it happens now. We do have autoscaling enabled and I don't think we reach the resource limits, so not sure why it happens.

Anything else we need to know: I managed to copy the core dump, and tried to investigate it, but couldn't find anything verbose about it:

GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from nginx...done.
[New LWP 4969]
[New LWP 4981]
[New LWP 4971]
[New LWP 4979]
[New LWP 4974]
[New LWP 4982]
[New LWP 4983]
[New LWP 4977]
[New LWP 4970]
[New LWP 4972]
[New LWP 4988]
[New LWP 4986]
[New LWP 4973]
[New LWP 4980]
[New LWP 4978]
[New LWP 4976]
[New LWP 4975]
[New LWP 4984]
[New LWP 4989]
[New LWP 4987]
[New LWP 4985]
[New LWP 4990]
[New LWP 5001]
[New LWP 4994]
[New LWP 4995]
[New LWP 4991]
[New LWP 5000]
[New LWP 4992]
[New LWP 4999]
[New LWP 4996]
[New LWP 4993]
[New LWP 4997]
[New LWP 4998]

warning: Unexpected size of section `.reg-xstate/4969' in core file.

warning: Can't read pathname for load map: No error information.
Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/4969' in core file.
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
[Current thread is 1 (LWP 4969)]
(gdb) backtrace
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
#1  0x00007fd38a72be21 in ?? () from /lib/libcrypto.so.1.1
#2  0x00007fd38a72bf24 in ASN1_item_free () from /lib/libcrypto.so.1.1
#3  0x00007fd38a94e62b in SSL_SESSION_free () from /lib/libssl.so.1.1
#4  0x00007fd38a7e2cdc in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#5  0x00007fd38a94f76c in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#6  0x00007fd38a965896 in ?? () from /lib/libssl.so.1.1
#7  0x00007fd38a959f48 in ?? () from /lib/libssl.so.1.1
#8  0x00007fd38a948ec2 in SSL_do_handshake () from /lib/libssl.so.1.1
#9  0x000055614f8c0174 in ngx_ssl_handshake (c=c@entry=0x7fd38a2c4418) at src/event/ngx_event_openssl.c:1694
#10 0x000055614f8c058d in ngx_ssl_handshake_handler (ev=0x7fd38a0ebc40) at src/event/ngx_event_openssl.c:2061
#11 0x000055614f8bac1f in ngx_epoll_process_events (cycle=0x55615199b2f0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x000055614f8adc62 in ngx_process_events_and_timers (cycle=cycle@entry=0x55615199b2f0) at src/event/ngx_event.c:257
#13 0x000055614f8b82fc in ngx_worker_process_cycle (cycle=0x55615199b2f0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:774
#14 0x000055614f8b6233 in ngx_spawn_process (cycle=cycle@entry=0x55615199b2f0, proc=0x55614f8b81d2 <ngx_worker_process_cycle>, data=0x0, name=0x55614f9dae3f "worker process", respawn=respawn@entry=0) at src/os/unix/ngx_process.c:199
#15 0x000055614f8b73aa in ngx_reap_children (cycle=cycle@entry=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:641
#16 0x000055614f8b9036 in ngx_master_process_cycle (cycle=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:174
#17 0x000055614f88ba00 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:385

In the meantime, I added a LimitRange for default limit of ephemeral-storage of 10Gi to prevent it to reach max node storage (my pods reached ~60Gi storage usage only from core dumps)

/kind bug

tokers commented 3 years ago

Seems that we don't have a mechanism to change the worker_rlimit_core directive, we may add this feature, what's your idea?

AmitBenAmi commented 3 years ago

That sounds good on its own, however, it won't make NGINX to create less core dumps, meaning that if I restrict it to 2MB, it can still create thousands of these dumps and still explode my filesystem (unless I set it to 0, meaning that no core dumps will be created, in which case I'm ignoring the problem rather noticing it exists)

tokers commented 3 years ago

That's like an internal bug of OpenSSL, it's difficult to troubleshoot it as the debug symbols were stripped. We may wait for a while and see whether somebody has some similar experiences, which might be useful.

longwuyuan commented 3 years ago

would you be interested in showing what your cluster looks like with ;


- cat /proc/cpuinfo ; cat /proc/meminfo # from nodes
- helm ls -A
- kubectl get all,nodes,ing -A -o wide
- kubectl -n ingresscontrollernamespace describe po ingresscontrollerpod
- Get the nginx.conf from inside the pod and paste it here
- CPu/Memory/Inodes/Disk related status from your monitoring
AmitBenAmi commented 3 years ago

@longwuyuan I don't want to expose that kind of information on my environment. If there is something more specific to this I can maybe share it, but this is a lot of information. I can say that my ingress pods didn't terminated, only created significant amount of core dumps

longwuyuan commented 3 years ago

Maybe write very clear details about hardware, software, config and the list of commands etc that someone can execute, for example on minikube, to be able to reproduce this problem

AmitBenAmi commented 3 years ago

I have no idea how to reproduce this. My hardware is EKS (AWS EC2). NGINX docker image is: v0.41.2

About configuration, I have thousands of ingresses that populate nginx.conf automatically with hundreds of location and other nginx configurations.

Any idea how can I export a full dump interpretation on this to maybe help understand the problem?

longwuyuan commented 3 years ago

not every AWS EKS user is reporting the same behaviour. There was one other issue reported stating core dumps. The best thought on that was to spread load. Any chance the problem is being caused by your use case only ?

/remove-kind bug /triage needs-information

AmitBenAmi commented 3 years ago

I double-checked and the load isn't different or suddenly too immense. Screen Shot 2021-04-29 at 10 52 10

I guess it is probably an error with something in my environment and not necessarily a bug in NGINX, but my nginx.conf consist of thousands of lines, @longwuyuan do you have any idea on where should I look for in the configuration itself?

longwuyuan commented 3 years ago

You could be hitting a limit or memory violation, hard to tell which until the core backtrace is explicit. Your earlier post shows '?' symbol in gdb and then it shows crypto and then libssl. I am no developer so can't help much but I thought what someone said elsewhere, that '?' means you are missing symbols. And then crypto/ssl could mean all your TLS config was coming into play and nginx could not handle the size, as you say, you have thousands.

You can upgrade to most recent release of ingress-controller, check and verify, how to run gdb for nginx coredumps and post another backtrace that shows the size or any other details of that datastructure that its complaining about ;

Unexpected size of section `.reg-xstate/4969' in core file

Also you can try to replicate the size of objects in another cluster but try spreading the load.

mitom commented 3 years ago

@tokers has the option to set worker_rlimit_core ever been added? We're now facing this issue and more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).

I realise ignoring the coredumps is hiding the issue, but in our scenario this would be much preferred to taking out the entire ingress with some misconfigured certs.

AmitBenAmi commented 3 years ago

@mitom how did you come up with finding that the chain is the root cause? Is it something you found out from the core dumps themselves?

mitom commented 3 years ago

No, the core dumps only contained:

#0  0x00007fa81c6d9c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00000000000000a0 in ?? ()
#2  0x00007fff2b051e20 in ?? ()
#3  0x00007fff2b051db0 in ?? ()
#4  0x0000000000000000 in ?? ()

which doesn't mean anything to me really.

It is more or less an educated guess based on that around the time we have this issue the logs were spammed with invalid certificate errors in the controller.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sepich commented 3 years ago

We have the same issue in coredump: https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-929951627 even with newer nginx and debian image.

more or less know the root cause for us (it's a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they're not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).

We also use cert-manager, but have no any errors or unvalidated certs. There are no errors both in cert-manager logs and ingress-nginx, but worker still dies with worker process 931 exited on signal 11. Another thing i've notices is that nginx_ingress_controller_nginx_process_connections only grows and never reduces: image And each of these small steps up - is worker die event. So per nginx stats there should be currently 30k active connections. But if I login to this exact pod - there is only 2k:

$ k -n ingress-nginx exec -it ingress-nginx-controller-5cf78859f4-7l9cc -- bash
bash-5.1$ netstat -tn | wc -l
2351

We cannot submit this issue to nginx upstream, because ingress-nginx compiles nginx from sources with additional plugins and patches. Also i have pretty limited knowledge of gdb and debug symbols, so was unable to find them for libssl both on alpine and debian to fix this part in coredump:

#2  0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3  0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4  0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5  0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1

Any help would be greatly appreciated.

rikatz commented 3 years ago

Hey @sepich thanks. I will start digging now into openssl problems, as we could remove the openresty bug.

Are you using NGINX v1.0.2?

Can you provide me some further information about the size of your environment, amount of Ingresses, amount of different SSL certificates?

Thanks

sepich commented 3 years ago

Are you using NGINX v1.0.2?

No we are still on k8s 1.19 and so ingress-nginx v0.49.2

size of your environment

From https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-928138962

It is just 215 Ingress objects / 80 rps, 3 ingress-nginx pods with 5% cpu load each.

99% of ingresses are SSL, so I would say it is 215 certs also. This number is pretty stable, not like ingresses are created and deleted each 5 min. More like once per week.

rikatz commented 3 years ago

Ok, thanks! Will check ASAP :)

rikatz commented 3 years ago

I'm wondering if this patch (https://github.com/openresty/openresty/blob/master/patches/openssl-1.1.1f-sess_set_get_cb_yield.patch) which is applied by Openresty shouldn't be applied in OpenSSL as well.

rikatz commented 3 years ago

@sepich in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?

doujiang24 commented 3 years ago

Hi @sepich , I have sent you an email to arrange a call with an interactive gdb session as said here. https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-929951627

Thanks very much!

sepich commented 3 years ago

@rikatz, great finding! This patch originally was created as two parts for both nginx and openssl: https://github.com/openresty/openresty/commit/97901f335709e8f3b2dec1c368bba20f3894fccc

https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/ssl/session.md#description

This Lua API can be used to implement distributed SSL session caching for downstream SSL connections, thus saving a lot of full SSL handshakes which are very expensive.

I've checked that no ngx.ssl.session, ssl_session_fetch_by_lua* and ssl_session_store_by_lua* is being used in ingress-nginx. We also do not use any Lua code in ingress snippets. So, I've deleted images/nginx/rootfs/patches/nginx-1.19.9-ssl_sess_cb_yield.patch file (to avoid rebuilding openssl), then rebuild nginx and v0.49.2. But the issue and coredump backtrace is the same:

#5  0x00007fdb5619efb4 in ?? () from /lib/libssl.so.1.1
#6  0x0000562d5b8b8c68 in ngx_ssl_handshake (c=c@entry=0x7fdb55a2fa20) at src/event/ngx_event_openssl.c:1720
#7  0x0000562d5b8b9081 in ngx_ssl_handshake_handler (ev=0x7fdb5588a0c0) at src/event/ngx_event_openssl.c:2069

But there is one more patch for ngx_event_openssl.c - nginx-1.19.9-ssl_cert_cb_yield.patch: https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block Checked that ingress-nginx lua code does not use this, and rebuild image without this patch too. But issue still remains. Looks like I misunderstood something, maybe you can build some test image with minimum amount of patches only to make ingress-nginx-controller working?

@doujiang24, got it!

rikatz commented 3 years ago

Yeap, I can.

Actually I already have a base image with the right patches, and proper linking:

ldd /sbin/nginx |grep ssl libssl.so.1.1 => /usr/local/openresty/openssl111/lib/libssl.so.1.1 (0x7f7be72b7000) libcrypto.so.1.1 => /usr/local/openresty/openssl111/lib/libcrypto.so.1.1 (0x7f7be6fc1000)

I have published this base image in rpkatz/nginx:patchedopenresty so you can build your own controller using it, for example, in case of "legacy/0.49x" branch:

On Fri, Oct 1, 2021 at 9:56 AM Alex R @.***> wrote:

@rikatz https://github.com/rikatz, great finding! This patch originally was created as two parts for both nginx and openssl: @.*** https://github.com/openresty/openresty/commit/97901f335709e8f3b2dec1c368bba20f3894fccc

https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/ssl/session.md#description

This Lua API can be used to implement distributed SSL session caching for downstream SSL connections, thus saving a lot of full SSL handshakes which are very expensive.

I've checked that no ngx.ssl.session, ssl_session_fetch_by_lua and ssl_session_store_by_lua is being used in ingress-nginx. We also do not use any Lua code in ingress snippets. So, I've deleted images/nginx/rootfs/patches/nginx-1.19.9-ssl_sess_cb_yield.patch file (to avoid rebuilding openssl), then rebuild nginx and v0.49.2. But the issue and coredump backtrace is the same:

5 0x00007fdb5619efb4 in ?? () from /lib/libssl.so.1.1

6 0x0000562d5b8b8c68 in ngx_ssl_handshake @.***=0x7fdb55a2fa20) at src/event/ngx_event_openssl.c:1720

7 0x0000562d5b8b9081 in ngx_ssl_handshake_handler (ev=0x7fdb5588a0c0) at src/event/ngx_event_openssl.c:2069

But there is one more patch for ngx_event_openssl.c - nginx-1.19.9-ssl_cert_cb_yield.patch: https://github.com/openresty/lua-nginx-module#ssl_certificate_by_lua_block Checked that ingress-nginx lua code does not use this, and rebuild image without this patch too. But issue still remains. Looks like I misunderstood something, maybe you can build some test image with minimum amount of patched only to make ingress-nginx-controller working?

@doujiang24 https://github.com/doujiang24, got it!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/7080#issuecomment-932202647, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWZQBK62OOFF2X5O3JM7O3UEWVYTANCNFSM43U7BRIA .

rikatz commented 3 years ago

/remove-lifecycle stale

sepich commented 3 years ago

you can build your own controller using it

Thank you, unfortunately it still fails (v0.49.2 on top of it):

Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
[Current thread is 1 (LWP 69)]
(gdb) bt
#0  0x00007f1b4834bb25 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#1  0x00007f1b4819b86f in OPENSSL_LH_doall_arg () from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1
#2  0x00007f1b4834cdb7 in SSL_CTX_flush_sessions () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#3  0x00007f1b48367d55 in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#4  0x00007f1b4835923d in ?? () from /usr/local/openresty/openssl111/lib/libssl.so.1.1
#5  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#6  0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#7  0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#8  0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#9  0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#10 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#11 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#12 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#13 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

Is it now possible to load openssl debug symbols somehow?

doujiang24 commented 3 years ago

@sepich The debug symbol package for openresty-openssl111 is openresty-openssl111-dbg. You can try to install it by apk add openresty-openssl111-dbg.

sepich commented 3 years ago

Thanks, it is:

echo https://openresty.org/package/alpine/v3.14/main >> /etc/apk/repositories
apk --allow-untrusted add openresty-openssl111-dbg

seems to be working:

(gdb) bt
#0  0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
#1  0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
    at crypto/lhash/lhash.c:196
#2  OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
#3  0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
#4  SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
#5  0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
#6  0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
#7  0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
#8  state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
#9  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
#11 0x0000561c111a26d3 in ngx_epoll_process_events (cycle=0x7f1b46488770, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x0000561c111956b0 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f1b46488770) at src/event/ngx_event.c:257
#13 0x0000561c1119fd7f in ngx_worker_process_cycle (cycle=0x7f1b46488770, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#14 0x0000561c1119ddc0 in ngx_spawn_process (cycle=cycle@entry=0x7f1b46488770, proc=proc@entry=0x561c1119fc76 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0x561c112c8037 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#15 0x0000561c1119ea55 in ngx_start_worker_processes (cycle=cycle@entry=0x7f1b46488770, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#16 0x0000561c111a0939 in ngx_master_process_cycle (cycle=0x7f1b46488770, cycle@entry=0x7f1b47f761a0) at src/os/unix/ngx_process_cycle.c:234
#17 0x0000561c11172c17 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

(gdb) bt full
#0  0x00007f1b4834bb25 in timeout_cb (s=0x7f1b471dc770, p=0x7ffd17e35850) at ssl/ssl_sess.c:1067
No locals.
#1  0x00007f1b4819b86f in doall_util_fn (arg=0x7ffd17e35850, arg@entry=0x7ffd17e35810, func_arg=func_arg@entry=0x7f1b4834bb10 <timeout_cb>, func=0x0, use_arg=1, lh=0x7f1b45a1ab50)
    at crypto/lhash/lhash.c:196
        i = 1781
        a = <optimized out>
        n = 0x0
        i = <optimized out>
        a = <optimized out>
        n = <optimized out>
#2  OPENSSL_LH_doall_arg (lh=0x7f1b45a1ab50, func=func@entry=0x7f1b4834bb10 <timeout_cb>, arg=arg@entry=0x7ffd17e35850) at crypto/lhash/lhash.c:211
No locals.
#3  0x00007f1b4834cdb7 in lh_SSL_SESSION_doall_TIMEOUT_PARAM (arg=0x7ffd17e35850, fn=0x7f1b4834bb10 <timeout_cb>, lh=<optimized out>) at ssl/ssl_sess.c:1081
No locals.
#4  SSL_CTX_flush_sessions (s=0x7f1b45a11390, t=<optimized out>) at ssl/ssl_sess.c:1096
        i = 256
        tp = {ctx = 0x7f1b45a11390, time = 1633097512, cache = 0x7f1b45a1ab50}
#5  0x00007f1b48343e98 in ssl_update_cache (s=s@entry=0x7f1b391c3980, mode=mode@entry=2) at ssl/ssl_lib.c:3562
        stat = <optimized out>
        i = <optimized out>
#6  0x00007f1b48367d55 in tls_construct_new_session_ticket (s=0x7f1b391c3980, pkt=<optimized out>) at ssl/statem/statem_srvr.c:4192
        tctx = <optimized out>
        tick_nonce = "\000\000\000\000\000\000\000\001"
        age_add_u = {age_add_c = "Ɩ\240", <incomplete sequence \350>, age_add = 3902838470}
        err = <optimized out>
#7  0x00007f1b4835923d in write_state_machine (s=0x7f1b391c3980) at ssl/statem/statem.c:843
        post_work = 0x7f1b48368c10 <ossl_statem_server_post_work>
        mt = 4
        pkt = {buf = 0x7f1b399499b0, staticbuf = 0x0, curr = 57, written = 57, maxsize = 18446744073709551615, subs = 0x7f1b371d0b20}
        ret = <optimized out>
        pre_work = 0x7f1b483689e0 <ossl_statem_server_pre_work>
        get_construct_message_f = 0x7f1b48368fd0 <ossl_statem_server_construct_message>
        confunc = 0x7f1b483675f0 <tls_construct_new_session_ticket>
        st = 0x7f1b391c39c8
        transition = 0x7f1b48368580 <ossl_statem_server_write_transition>
        cb = 0x561c111a4491 <ngx_ssl_info_callback>
        st = <optimized out>
        ret = <optimized out>
        transition = <optimized out>
        pre_work = <optimized out>
        post_work = <optimized out>
        get_construct_message_f = <optimized out>
        cb = <optimized out>
        confunc = <optimized out>
        mt = <optimized out>
        pkt = {buf = <optimized out>, staticbuf = <optimized out>, curr = <optimized out>, written = <optimized out>, maxsize = <optimized out>, subs = <optimized out>}
#8  state_machine (s=0x7f1b391c3980, server=1) at ssl/statem/statem.c:443
        buf = 0x0
        cb = 0x561c111a4491 <ngx_ssl_info_callback>
        st = <optimized out>
        ret = <optimized out>
        ssret = <optimized out>
#9  0x0000561c111a7cda in ngx_ssl_handshake (c=c@entry=0x7f1b47b58b78) at src/event/ngx_event_openssl.c:1720
        n = <optimized out>
        sslerr = <optimized out>
        err = <optimized out>
        rc = <optimized out>
#10 0x0000561c111a80fe in ngx_ssl_handshake_handler (ev=0x7f1b479bc100) at src/event/ngx_event_openssl.c:2091
        c = 0x7f1b47b58b78
doujiang24 commented 3 years ago

@sepich Great, the timeout_cb is the first key function we want to verify first.

sepich commented 3 years ago

@rikatz, Could you please share how to build image like rpkatz/nginx:patchedopenresty? (@doujiang24 asks to add some debugging to openssl and test)

rikatz commented 3 years ago

https://github.com/kubernetes/ingress-nginx/pull/7732 This way :)

sepich commented 3 years ago

While recompiling openssl I've found workaround for this issue - edit nginx.conf ssl_session_cache builtin:1000 shared:SSL:10m; and drop builtin:1000. From docs:

builtin a cache built in OpenSSL; Use of the built-in cache can cause memory fragmentation.

using only shared cache without the built-in cache should be more efficient.

Unfortunately it is not exposed via some annotation, so have to edit template. There is even SO article for this. Interesting to know why builtin:1000 is hardcoded. I understand that this is not a fix for openssl issue, but maybe drop builtin from template for everybody, as it stated in docs?

doujiang24 commented 3 years ago

Hello, @sepich Glad you find a way to avoid segfault in #7777 . But I think it may be a workaround, may not be a proper fix. According to Nginx doc, use "builtin" and "shared" at the same time should be supported: http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache

Unfortunately, however, after talking to OpenSSL and Nginx team, I still can not find where the bug is. https://github.com/openssl/openssl/issues/16733#issue-1014329932 http://mailman.nginx.org/pipermail/nginx-devel/2021-October/014372.html

Hello @rikatz Maybe you can help to expose an annotation to enable "builtin" and disable it by default? So that one could try to reproduce it if they are still interesting to fix the bug. I'm not sure if it is worthing, use "builtin" and "shared" may not be a good choice usually.

rikatz commented 3 years ago

yeah sure, I will open a new PR and add that as a configuration :)