Closed ReillyTevera closed 3 years ago
We have released a v0.49 with all Lua stuff updated.
I will test here, but if someone is willing to help and provide some tests as well, please be my guess :)
OK, so so far what I can see from both ingresses (new and v0.45) I can see some race conditions while writing the file:
The CPU usage in v0.45 is much higher than in v1.0.0-beta3 also.
Will leave the environment running for a bit and check later.
There are 600 Ingresses objects, pointing to the same service. Maybe I need to add some more spice into this?
E0822 20:21:59.148743 7 controller.go:158] Unexpected failure reloading the backend:
exit status 1
2021/08/22 20:21:59 [emerg] 16153#16153: unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:84190
nginx: [emerg] unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:84190
E0822 20:21:59.148786 7 queue.go:130] "requeuing" err="exit status 1\n2021/08/22 20:21:59 [emerg] 16153#16153: unexpected end of file, expecting \";\" or \"}\" in /etc/nginx/nginx.conf:84190\nnginx: [emerg] unexpected end of file, expecting \";\" or \"}\" in /etc/nginx/nginx.conf:84190\n" key="default/test-294"
I0822 20:21:59.148827 7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"namespace-old", Name:"ingress-nginx-controller-5cf8cb98c5-j8zzp", UID:"195cd571-cb4b-4869-b055-6bdf9a92625b", APIVersion:"v1", ResourceVersion:"11172", FieldPath:""}): type: 'Warning' reason: 'RELOAD' Error reloading NGINX: exit status 1
2021/08/22 20:21:59 [emerg] 16153#16153: unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:84190
nginx: [emerg] unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:84190
Deployed on one of our non-critital ingress class. It start without segfault this time. I will test on other more high load ingress somewhere this week. I will report back.
Thanks @laghoule and please keep me posted!
Ok, on startup on one of my ingress-controller with 405 ingress: v0.49.0
I0825 13:14:40.853435 6 controller.go:148] "Configuration changes detected, backend reload required"
I0825 13:14:42.135939 6 controller.go:165] "Backend successfully reloaded"
I0825 13:14:42.136319 6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress", Name:"ingress-nginx-internal-controller-7ff4c6cdcc-qbhjt", UID:"09cb3f86-1187-4565-86a9-fe483c505e4c", APIVersion:"v1", ResourceVersion:"587431195", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
2021/08/25 13:14:43 [alert] 26#26: worker process 101 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 102 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 31 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 169 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 202 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 235 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 268 exited on signal 11 (core dumped)
2021/08/25 13:14:43 [alert] 26#26: worker process 301 exited on signal 11 (core dumped)
2021/08/25 13:14:44 [alert] 26#26: worker process 334 exited on signal 11 (core dumped)
2021/08/25 13:14:44 [alert] 26#26: worker process 400 exited on signal 11 (core dumped)
2021/08/25 13:14:44 [alert] 26#26: worker process 367 exited on signal 11 (core dumped)
2021/08/25 13:14:46 [alert] 26#26: worker process 433 exited on signal 11 (core dumped)
2021/08/25 13:14:46 [alert] 26#26: worker process 466 exited on signal 11 (core dumped)
2021/08/25 13:14:46 [alert] 26#26: worker process 499 exited on signal 11 (core dumped)
2021/08/25 13:14:48 [alert] 26#26: worker process 532 exited on signal 11 (core dumped)
2021/08/25 13:14:48 [alert] 26#26: worker process 565 exited on signal 11 (core dumped)
2021/08/25 13:14:51 [alert] 26#26: worker process 598 exited on signal 11 (core dumped)
2021/08/25 13:14:54 [alert] 26#26: worker process 631 exited on signal 11 (core dumped)
2021/08/25 13:14:54 [alert] 26#26: worker process 664 exited on signal 11 (core dumped)
2021/08/25 13:14:57 [alert] 26#26: worker process 697 exited on signal 11 (core dumped)
2021/08/25 13:14:57 [alert] 26#26: worker process 730 exited on signal 11 (core dumped)
reverting to v0.43.0 fix the segfault.
From what I'm reading it is possible that there's some issue with musl and memory allocation, has anyone tried to use mimalloc instead? you can do it by adding to your ingress-nginx container:
env:
- name: LD_PRELOAD
value: /usr/local/lib/libmimalloc.so
Note that I'm not suggesting this is a permanent solution. But if this solves the problem we can confirm that the issue is with alpine/musl and then look into a permanent solution (i.e changing alpine version or using debian based base image)
@ElvinEfendi we already use mimalloc in our deployment. I think I have not tried without it, humm...
No more
My company using 0.46 on the production cluster which has high usage but no issue at all.
But on the development cluster dose, the only difference is ingresses on the dev cluster are much more and keep reloading more frequent
No more restart for 2 days with v0.49 on the dev cluster
Both 0.46 and 0.49 with
env:
- name: LD_PRELOAD
value: /usr/local/lib/libmimalloc.so
We have couple of k8s clusters with stable rate of signal 11
errors on v0.43.0.
I've tried to switch to libmimalloc.so
, then to v0.49.0 + libmimalloc.so
but it has no any effect.
So i have to try build debian image for v0.49.0 based on work in https://github.com/kubernetes/ingress-nginx/pull/7593
# build image with nginx 1.20
git checkout rikatz/new-base-image
docker build -t nginx-debian images/nginx-debian/rootfs/
# use controller go binaries from original image
git checkout controller-v0.49.0
cat << EOF > rootfs/Dockerfile
ARG BASE_IMAGE
FROM k8s.gcr.io/ingress-nginx/controller:v0.49.0 as orig
FROM ${BASE_IMAGE}
ARG TARGETARCH
ARG VERSION
ARG COMMIT_SHA
ARG BUILD_ID=UNSET
LABEL org.opencontainers.image.title="NGINX Ingress Controller for Kubernetes"
LABEL org.opencontainers.image.documentation="https://kubernetes.github.io/ingress-nginx/"
LABEL org.opencontainers.image.source="https://github.com/kubernetes/ingress-nginx"
LABEL org.opencontainers.image.vendor="The Kubernetes Authors"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.version="${VERSION}"
LABEL org.opencontainers.image.revision="${COMMIT_SHA}"
LABEL build_id="${BUILD_ID}"
WORKDIR /etc/nginx
COPY --chown=www-data:www-data --from=orig /etc/nginx /etc/nginx
COPY --chown=www-data:www-data --from=orig /dbg /
COPY --chown=www-data:www-data --from=orig /nginx-ingress-controller /
COPY --chown=www-data:www-data --from=orig /wait-shutdown /
# Fix permission during the build to avoid issues at runtime
# with volumes (custom templates)
RUN bash -xeu -c ' \
writeDirs=( \
/etc/ingress-controller \
/etc/ingress-controller/ssl \
/etc/ingress-controller/auth \
/var/log \
/var/log/nginx \
); \
for dir in "${writeDirs[@]}"; do \
mkdir -p ${dir}; \
chown -R www-data.www-data ${dir}; \
done'
RUN apt-get update -qq && apt-get install -y libcap2-bin \
&& setcap cap_net_bind_service=+ep /nginx-ingress-controller \
&& setcap -v cap_net_bind_service=+ep /nginx-ingress-controller \
&& setcap cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
&& setcap -v cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
&& setcap cap_net_bind_service=+ep /usr/bin/dumb-init \
&& setcap -v cap_net_bind_service=+ep /usr/bin/dumb-init \
&& apt-get purge -y libcap2 libcap2-bin libpam-cap
USER www-data
# Create symlinks to redirect nginx logs to stdout and stderr docker log collector
RUN ln -sf /dev/stdout /var/log/nginx/access.log \
&& ln -sf /dev/stderr /var/log/nginx/error.log
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
CMD ["/nginx-ingress-controller"]
EOF
# build image with controller ontop of nginx-debian
ARCH=amd64 BASE_IMAGE=nginx-debian REGISTRY=sepa make image
For who want to test, image is available as sepa/ingress-nginx-debian:v0.49.0
.
Unfortunately now error looks like:
Sep 10, 2021 @ 17:26:47.971 | 2021/09/10 14:26:47 [alert] 26#26: worker process 69 exited on signal 6
Sep 10, 2021 @ 17:26:47.904 | double free or corruption (!prev)
From metrics it looks like some kind of memory leak:
Each of those dips for avg(nginx_ingress_controller_nginx_process_resident_memory_bytes)
is signal 11
error.
At 17:00 I switch 3 pods to debian based image.
Then at 17:24 and 17:26 there are dips from signal 6
error
So a similar crash occurs both under musl and glibc. It seems that there actually is a double free / memory corruption bug somewhere. All the stack traces posted here indicate that they happen in some Lua module (using FFI I guess?). I wonder if some more information about what actually is being freed from Lua could be found in the core dumps?
I've tried reproducing this in a test cluster a couple of times now with no success. I haven't been able to capture a core dump from any of our production clusters either, so I can't investigate this further myself.
Can someone who is frequently seeing these crashes perhaps share if they are using any of the more special features in ingress-nginx, especially ones that are implemented in Lua (e.g. global rate limiting or non-default load balancing options)? Or is someone even able to share the full nginx config file from an ingress controller where the crashes occur?
Thank you all for the help on the investigation.
So following @luryus I've seen that between 0.43 and 0.44 the only lua feature implemented was Global Rate Limiting -> https://github.com/kubernetes/ingress-nginx/pull/6673
I'm wondering about generating a version on my repo, reverting this specific PR only and asking for some of the folks here to test and check if this is the responsible for the segfault.
I'll also investigate the module in lua and see if I can figure out about something weird happening there.
Hi folks,
On an attempt to identify what module is causing this, and based in all your feedback (thanks!!) I've published a new version of ingress-nginx without lua-resty-ipmatcher.
Can you please make some tests and give some feedback?
Here are the images:
Please note that if you use the global rate limiter feature, this REMOVES THE ALLOW LIST.
Thanks!!
Hi folks,
On an attempt to identify what module is causing this, and based in all your feedback (thanks!!) I've published a new version of ingress-nginx without lua-resty-ipmatcher.
Can you please make some tests and give some feedback?
Here are the images:
- Legacy/v0.49 -> rpkatz/nginx-ingress:v0.49-nomatcher
- Main/v1.0.0 -> rpkatz/nginx-ingress:v1.0.0-nomatcher
Please note that if you use the global rate limiter feature, this REMOVES THE ALLOW LIST.
Thanks!!
Why not just disable the feature that depends on the ipmatcher? It seems that prepare two images are loud.
Why not just disable the feature that depends on the ipmatcher? It seems that prepare two images are loud.
@tokers was mostly about not making a new release just to make sure the problem IS ipmatcher.
We (at least me) are not sure yet if this is the problem or not.
Hi folks, I'm from OpenResty Community, one of the OpenResty Core Developers. I think I can help to debug this segmentation fault issue. Could someone help to:
provide a docker image of the accident scene that contains a core file and all binary files in the running container, so that we can use gdb to debug it locally. If you do not want to provide it in public, you can send it to my personal email (doujiang24@gmail.com).
or, install OpenResty XRay in the accident container, and share the XRay console URL with me. https://openresty.com/en/xray/
or, use openresty-gdb-utils to get the Lua backtrace by yourself. https://github.com/openresty/openresty-gdb-utils
Here is my debug plan for now:
frame 3
info symbol *(long *)$rbx
we may dig more after getting the results in the previous steps.
Thanks @doujiang24 I have notified @rikatz on slack.
So I did the following changes in our ingress-nginx fork and since then I don't see the crashes (Mind you, our clusters do not remain static either, so what I'm about to say might not necessarily be what "solved" the problem for us. It could be just incidental timing):
Here's the patch file including everything I described above: https://gist.github.com/ElvinEfendi/c962b4fc5b73bea6ddaea02c6c49718b
thanks @doujiang24 and @ElvinEfendi
Questions:
@rikatz yeah, downgrade to v1.19.9 with the resolver patch should be a better choice.
Using the OpenResty release is always the recommended way. We usually need some fixes when upgrading the NGINX core for a new OpenResty release in my experience.
By the way, the docker image with the core file is still welcome.
In my gut, this segmentation fault has nothing to do with NGINX core (just guess, not sure yet).
@ReillyTevera @yuha0 @alfianabdi @laghoule thanks for your patience with this. As you can see from the most recent comments above, it would really really add a lot of value to the effort, if you were able to provide even one or multiple core dump file(s) to @doujiang24
Hoping for the best
Apply patches from MR to controller-v0.49.0 tag and nothing changed, still have errors on production workloads. Are there any patches in main to fix this in 0.49.0 version?
@longwuyuan I will try to have some time next week to test & provide core dump
Thanks @laghoule
@rvadim we are going to release a new version containing the downgrade to 1.19 (and some other fixes) probably this weekend, and I count on you to test this for us!
Thanks!
@rvadim Do you mean the segmentation fault errors? If yes, do you have core files? It will be very helpful if you could provide a core file as said in https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-918845073. Thanks!
Do you mean the segmentation fault errors?
Yes, [alert] 28#28: worker process 107 exited on signal 11
If yes, do you have core files?
Unfortinally, no(
It will be very helpful if you could provide a core file as said in #6896 (comment).
Will try today
Thank you.
Hi folks. The latest releases (0.49.1 and 1.0.1) have been downgraded to nginx v1.19.9 and alpine was upgraded to 3.14.2
Can you please test it and provide us some feedback whether the problem persists?
Thank you so much!
Hi! Thank you for update.
Tested with k8s.gcr.io/ingress-nginx/controller:v0.49.1@sha256:4080849490fd4f61416ad7bdba6fdd0ee58164f2e13df1128b407ce82d5f3986
No luck. Same error, but crashes more rare. Once every 30 minutes instead of once every 10 minutes.
@rvadim do you mind posting the coredumps requested on previous comments, so openresty developers can take a look?
Thanks!!
Yeah, core files are welcome! Thanks!
@rvadim do you mind posting the coredumps requested on previous comments, so openresty developers can take a look?
Sorry, I can not. Data under NDA(
Maybe this will help:
[Fri Sep 24 06:21:59 2021] nginx[438455]: segfault at 7f696171df08 ip 00007f696e131c0c sp 00007ffd10c81b90 error 4 in libssl.so.1.1[7f696e114000+40000]
[Fri Sep 24 06:41:38 2021] nginx[457573]: segfault at 7f696099b178 ip 00007f696e131c0c sp 00007ffd10c81b90 error 4 in libssl.so.1.1[7f696e114000+40000]
[Fri Sep 24 06:57:41 2021] nginx[466361]: segfault at 10 ip 00007f696e27d896 sp 00007ffd10c81b78 error 4 in ld-musl-x86_64.so.1[7f696e26f000+48000]
[Fri Sep 24 07:34:05 2021] nginx[473578]: segfault at 10 ip 00007f696e27d896 sp 00007ffd10c81b78 error 4 in ld-musl-x86_64.so.1[7f696e26f000+48000]
@rvadim Thanks anyway.
I'm currently experimenting with v0.49.1 but until now there are no crashes (100m running). As soon as I get a crash with a dump I can send it to you @doujiang24 by e-mail but I'm not allowed to share it with the wide public.
Same experience. Much less frequent, but more core dumps at once. Do you have any idea what type of traffic causes it? I have lots of different traffic flowing through it and could try and reproduce if there is any indication of type of file.
I have core dumps but cannot provide at this time. If you can provide details you need from the core dumps I can run them.
Thank you for your help running this down.
Hi @derek-burdick Please follow these steps, Thanks!
use gdb:
bt
, also bt full
is helpful.
like the backtrace in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604lj_vm_ffi_call
, by using frame N
, N
means the frame number in the backtrace.
N
is 3 for the backtrace sample in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604info symbol *(long *)$rbx
.use the lbt
command in openresty-gdb-utils (https://github.com/openresty/openresty-gdb-utils#lbt) to get the Lua backtrace.
Also, more ways that could be easier as said in https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-918845073, like use OpenResty XRay.
Core was generated by `nginx: master process /usr/local/nginx/sbin/nginx -c'. Program terminated with signal SIGSEGV, Segmentation fault.
108 ./arch/x86_64/atomic_arch.h: No such file or directory. [Current thread is 1 (LWP 1209)]
(gdb) bt
(gdb) frame 3
(gdb) info symbol (long )$rbx chash_point_sort in section .text of /usr/local/lib/lua/librestychash.so
(gdb) bt full
No locals.
reserved = 1100
reserved = <optimized out>
g = 0x55e4601a85b8
idx = 2
stride = 9344
start = 0x7f260a4de910 ""
end = 0x7f260a4e0d8c ""
self = <optimized out>
all = <optimized out>
mi = {base = <optimized out>, len = <optimized out>}
No symbol table info available.
cc = {func = 0x7f26096d2290 <chash_point_sort>, spadj = 8, nsp = 0 '\000', retref = 0 '\000', ngpr = 0 '\000', nfpr = 0 '\000', fpr = {{d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0, 0, 0},
l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0, 0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0,
0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0, 0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0},
i = {0, 0, 0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0, 0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0,
0, 0}, i = {0, 0, 0, 0}, l = {0, 0}}, {d = {0, 0}, f = {0, 0, 0, 0}, b = '\000' <repeats 15 times>, s = {0, 0, 0, 0, 0, 0, 0, 0}, i = {0, 0, 0, 0}, l = {0, 0}}}, gpr = {139801194063304, 480, 0, 0, 0, 0}, stack = {139801194062632,
-563886247713936, 139801194062760, 139801390819424, 139801345901440, 6874693240, 96, 514, 0, 139801194062760, 139801194062632, 139801343553632, 139801343538536, 139801345975368, 140727183179372, 139801343455376, 4647151865492930560,
139801391083316, 0, 3864, 0, 139801345901552, 139801345901440, 139801390807232, 0, 139801345901440, 3840, 638, 24, 139801343553632, 139801345901440, 139801345975368}}
gcsteps = 0
ret = <optimized out>
cts = 0x7f260990e448
ct = 0x39f0
sz = <optimized out>
ret = <optimized out>
cts = 0x7f260990e448
cd = <optimized out>
id = 618
ct = <optimized out>
tv = <optimized out>
mm = MM_call
No symbol table info available.
g = 0x7f26098fc3f0
oldh = 0 '\000'
ef = 88
status = <optimized out>
status = <optimized out>
base = 10
old_pool = 0x0
status = <optimized out>
rv = <optimized out>
cur = <optimized out>
prev = <optimized out>
i = <optimized out>
conf = {name = 0x0, args = 0x0, cycle = 0x7f2609d9fa90, pool = 0x7f2609d4ec90, temp_pool = 0x0, conf_file = 0x7ffd99c38060, log = 0x7f2609d4ecf8, ctx = 0x7ffd99c37fe0, module_type = 0, cmd_type = 0, handler = 0x0, handler_conf = 0x0}
cf_file = {file = {fd = 0, name = {len = 21, data = 0x7f2609d4ef9d "/etc/nginx/nginx.conf"}, info = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0,
st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}}, offset = 0, sys_offset = 0, log = 0x0, thread_handler = 0x0, thread_ctx = 0x0, thread_task = 0x0,
aio = 0x0, valid_info = 0, directio = 0}, buffer = 0x0, dump = 0x0, line = 0}
fake_cycle = <optimized out>
modules = <optimized out>
--Type
set = {__bits = {0, 94439351196560, 402745863, 0, 0, 0, 140727183180352, 140727183180416, 1, 139801391508357, 6, 16, 1, 139801391412005, 0, 139801185484816}}
n = <optimized out>
tp = <optimized out>
i = 67
cpu_affinity = <optimized out>
rlmt = {rlim_cur = 523264, rlim_max = 523264}
ccf = <optimized out>
ls = <optimized out>
worker = <optimized out>
on = 1
pid = 0
s = 1
i = 1
n = <optimized out>
live = 1
ch = {command = 2, pid = 1143, slot = 1, fd = -1}
ccf = <optimized out>
title = <optimized out>
p = <optimized out>
size = <optimized out>
i = <optimized out>
sigio = 0
set = {__bits = {0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 139801386582552, 0, 0, 0, 139801391630023}}
itv = {it_interval = {tv_sec = 0, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}
live = 1
delay = 0
ccf = 0x7f2609d50218
b = <optimized out>
log = 0x55e45ff8d860 <ngx_log>
i = <optimized out>
cycle = 0x7f260bfc8200
init_cycle = {conf_ctx = 0x0, pool = 0x7f260bfed600, log = 0x55e45ff8d860 <ngx_log>, new_log = {log_level = 0, file = 0x0, connection = 0, disk_full_time = 0, handler = 0x0, data = 0x0, writer = 0x0, wdata = 0x0, action = 0x0, next = 0x0},
log_use_stderr = 0, files = 0x0, free_connections = 0x0, free_connection_n = 0, modules = 0x0, modules_n = 0, modules_used = 0, reusable_connections_queue = {prev = 0x0, next = 0x0}, reusable_connections_n = 0, connections_reuse_time = 0,
listening = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, paths = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, config_dump = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, config_dump_rbtree = {
root = 0x0, sentinel = 0x0, insert = 0x0}, config_dump_sentinel = {key = 0, left = 0x0, right = 0x0, parent = 0x0, color = 0 '\000', data = 0 '\000'}, open_files = {last = 0x0, part = {elts = 0x0, nelts = 0, next = 0x0}, size = 0,
nalloc = 0, pool = 0x0}, shared_memory = {last = 0x0, part = {elts = 0x0, nelts = 0, next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, connection_n = 0, files_n = 0, connections = 0x0, read_events = 0x0, write_events = 0x0, old_cycle = 0x0,
conf_file = {len = 21, data = 0x7ffd99c3985c "l/nginx/sbin/nginx -c /etc/nginx/nginx.conf"}, conf_param = {len = 0, data = 0x0}, conf_prefix = {len = 11, data = 0x7ffd99c3985c "l/nginx/sbin/nginx -c /etc/nginx/nginx.conf"}, prefix = {
len = 17, data = 0x55e45ff0e19f "/usr/local/nginx/"}, error_log = {len = 24, data = 0x55e45ff0e1c7 "/var/log/nginx/error.log"}, lock_file = {len = 0, data = 0x0}, hostname = {len = 0, data = 0x0}}
cd = <optimized out>
ccf = <optimized out>
@derek-burdick Thanks!
This shows it's crashing in the resty.balancer
module: https://github.com/openresty/lua-resty-balancer/blob/master/chash.c#L146
I'll provide more steps soon.
(gdb) info symbol *(long *)$rbx
chash_point_sort in section .text of /usr/local/lib/lua/librestychash.so
@derek-burdick Please follow these steps, thanks!
Get the arguments for the chash_point_sort
function call.
(gdb) frame 3
(gdb) p ((CCallState *)$rbx)->gpr[1]
(gdb) p (chash_point_t *)((CCallState *)$rbx)->gpr[0]
(gdb) p (*(chash_point_t *)((CCallState *)$rbx)->gpr[0])@(((CCallState *)$rbx)->gpr[1])
Get the assert failure reason in get_nominal_size
function call, according to musl-libc: https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/meta.h#n159
(gdb) frame 1
(gdb) p ((unsigned char *) p) [-3]
(gdb) p ((unsigned char *) p) [-3] >> 5
(gdb) p *(const uint32_t *)(end-4)
(gdb) end[-5]
(gdb) *end
Also, please check more other core files if they both crashed in the same way.
It may be memory corruption by other places but crashes in chash_point_sort
.
Thanks!
Oh, seems there is a bug in chash_point_sort
, I will verify it soon. The information from the previous steps is still welcome.
Ok, I think I figured out the bug, I will fix it and kick out a new lua-resty-balancer release soon.
In short, there is a memory corruption bug in resty.balancer
but we do not hit it usually.
It happens due to the musl libc changed the malloc implementation in musl 1.2.1 (https://git.musl-libc.org/cgit/musl/commit/?h=v1.2.1&id=73cc775bee53300c7cf759f37580220b18ac13d3) which is first included in alpine 3.13.0 (https://www.alpinelinux.org/posts/Alpine-3.13.0-released.html) which is first included in ingress-nginx v0.44.0.
The information from the previous steps is still welcome, just in case I'm wrong.
@doujiang24 I'm really glad and thankful for all your help and time on this issue!!
Btw do you think that for this and future cases moving do a glibc based distro would be better, as per openresty compability also?
Thanks once more!
@rikatz
I think use musl-libc should be fine. OpenResty will keep compatibility with musl-libc as always.
It's a bug in resty-balancer
and the new malloc implementation in musl-libc just make it easy to reproduce. It may happen in other malloc implementations in theory (just very very low probability).
@rikatz I had fixed the bug and tagged a new release v0.04: https://github.com/openresty/lua-resty-balancer/releases/tag/v0.04
I think it's good to push a new ingress-nginx image.
Glad this issue can be fixed. Also, please feel free to contact me if the issue is not fixed or any other bugs. Thank you all!
Thank you. Let me upgrade it.
/reopen
@rikatz: Reopened this issue.
We have released v0.49.2 and v1.0.2 with the fix.
Can you all please test and give us some feedbacks?
Thank you so much, specially @doujiang24 and @tao12345666333 for the help!
Thank you for prompt fix! I've tested k8s.gcr.io/ingress-nginx/controller:v0.49.2@sha256:84e351228337bb7b09f0e90e6b6f5e2f8f4cf7d618c1ddc1e966f23902d273db
and unfortunately worker process 71 exited on signal 11
continue to happen, but now twice slower.
Also, it does not seems to be lua related now:
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
141 src/malloc/mallocng/meta.h: No such file or directory.
[Current thread is 1 (LWP 71)]
(gdb) backtrace
#0 get_meta (p=p@entry=0x7f5bfc59b550 "") at src/malloc/mallocng/meta.h:141
#1 0x00007f5bfdf36c2f in __libc_free (p=0x7f5bfc59b550) at src/malloc/mallocng/free.c:105
#2 0x00007f5bfdc7dd0d in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#3 0x00007f5bfddeb6d0 in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#4 0x00007f5bfde01ad3 in ?? () from /lib/libssl.so.1.1
#5 0x00007f5bfddf5fb4 in ?? () from /lib/libssl.so.1.1
#6 0x000055a5517ca678 in ngx_ssl_handshake (c=c@entry=0x7f5bfd670728) at src/event/ngx_event_openssl.c:1720
#7 0x000055a5517caa91 in ngx_ssl_handshake_handler (ev=0x7f5bfd4d7e90) at src/event/ngx_event_openssl.c:2091
#8 0x000055a5517c5103 in ngx_epoll_process_events (cycle=0x7f5bfbf927d0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#9 0x000055a5517b8140 in ngx_process_events_and_timers (cycle=cycle@entry=0x7f5bfbf927d0) at src/event/ngx_event.c:257
#10 0x000055a5517c27c8 in ngx_worker_process_cycle (cycle=0x7f5bfbf927d0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
#11 0x000055a5517c0850 in ngx_spawn_process (cycle=cycle@entry=0x7f5bfbf927d0, proc=proc@entry=0x55a5517c26bf <ngx_worker_process_cycle>, data=data@entry=0x0,
name=name@entry=0x55a5518e7f97 "worker process", respawn=respawn@entry=-4) at src/os/unix/ngx_process.c:199
#12 0x000055a5517c149e in ngx_start_worker_processes (cycle=cycle@entry=0x7f5bfbf927d0, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:373
#13 0x000055a5517c3369 in ngx_master_process_cycle (cycle=0x7f5bfbf927d0, cycle@entry=0x7f5bfda95200) at src/os/unix/ngx_process_cycle.c:234
#14 0x000055a551795af9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
This is without libmimalloc.so
Hi @sepich , question:
Are you using TLS/SSL for mutual auth? Just for secure endpoint?
What's your load currently, and amount of ingress objects?
Thanks!
NGINX Ingress controller version: 0.44.0
Kubernetes version (use
kubectl version
): 1.18.3 Environment:What happened: We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:
We discovered that the following message was being printed to the system log as well (timed with the worker exits):
I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.
We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.
I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:
One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?
/kind bug