kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.59k stars 8.27k forks source link

segfault due to opentracing #2222

Closed trnl closed 6 years ago

trnl commented 6 years ago

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): no

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): nginx, opentracing, zipkin, segfault

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.10.2

Kubernetes version (use kubectl version):

> k version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T22:14:38Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.12", GitCommit:"3bda299a6414b4866f179921610d6738206a18fe", GitTreeState:"clean", BuildDate:"2017-12-29T08:39:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

What happened: worker process exiting due to segfault

What you expected to happen: no segfaults, no worker restarts

How to reproduce it (as minimally and precisely as possible): Happens almost every 2 seconds

{"log":"2018/03/20 18:34:54 [alert] 23#23: worker process 11075 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:54.142872722Z"}
{"log":"2018/03/20 18:34:54 [alert] 23#23: worker process 11109 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:54.517182591Z"}
{"log":"2018/03/20 18:34:57 [alert] 23#23: worker process 11143 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:57.923391335Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11178 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.194986234Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11212 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.557907914Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11246 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.816366263Z"}
{"log":"2018/03/20 18:34:59 [alert] 23#23: worker process 11280 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:59.19434845Z"}
{"log":"2018/03/20 18:35:02 [alert] 23#23: worker process 11314 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:02.943432758Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11349 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.206377447Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11383 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.56327926Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11417 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.81832701Z"}
{"log":"2018/03/20 18:35:08 [alert] 23#23: worker process 11451 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:08.92230984Z"}
{"log":"2018/03/20 18:35:08 [alert] 23#23: worker process 11486 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:08.979320448Z"}
{"log":"2018/03/20 18:35:13 [alert] 23#23: worker process 11520 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:13.231301619Z"}
{"log":"2018/03/20 18:35:13 [alert] 23#23: worker process 11555 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:13.791207004Z"}
{"log":"2018/03/20 18:35:14 [alert] 23#23: worker process 11589 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:14.156415018Z"}
{"log":"2018/03/20 18:35:14 [alert] 23#23: worker process 11623 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:14.5254932Z"}
{"log":"2018/03/20 18:35:18 [alert] 23#23: worker process 11657 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:18.272065288Z"}
{"log":"2018/03/20 18:35:18 [alert] 23#23: worker process 11692 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:18.53148589Z"}
{"log":"2018/03/20 18:35:19 [alert] 23#23: worker process 11726 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:19.517368566Z"}

Anything else we need to know:

> tail -f nginx-ingress-controller-3228907275-p158b_kube-system_nginx-ingress-controller-45af43a95ce0ff5989dfe5783d5457214952edc83dc5de1e6745b436c7c5e23b.log | grep 'opentracing' -C3
2018/03/20 17:51:09 [debug] 23719#23719: *680808 http script set $service_name
2018/03/20 17:51:09 [debug] 23719#23719: *680808 post rewrite phase: 4
2018/03/20 17:51:09 [debug] 23719#23719: *680808 generic phase: 5
2018/03/20 17:51:09 [debug] 23719#23719: *680808 extraced opentracing span context from request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 starting opentracing request span for 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 starting opentracing location span for \"/\"(00007FE651FDDD00) in request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 injecting opentracing span context from request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 adding opentracing header \"x-b3-traceid:09e429b7d83c3090\" in request 00007FE65CCFD050
2018/03/20 17:51:09 [alert] 23#23: worker process 23719 exited on signal 11 (core dumped)
2018/03/20 17:51:09 [debug] 23753#23753: epoll add event: fd:13 op:1 ev:00002001
2018/03/20 17:51:09 [debug] 23753#23753: epoll add event: fd:14 op:1 ev:00002001

After analysis of the coredump it's visible that the issue in the opentracing insert_header block:

#0  0x0000561139f929ec in ngx_list_push ()
#1  0x00007f7a1adbb6ab in ngx_opentracing::inject_span_context(opentracing::v1::Tracer const&, ngx_http_request_s*, opentracing::v1::SpanContext const&) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#2  0x00007f7a1adbcf81 in ngx_opentracing::OpenTracingRequestInstrumentor::OpenTracingRequestInstrumentor(ngx_http_request_s*, ngx_http_core_loc_conf_s*, ngx_opentracing::opentracing_loc_conf_t*) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#3  0x00007f7a1adbdce6 in ngx_opentracing::on_enter_block(ngx_http_request_s*) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#4  0x000056113a097a33 in ngx_http_core_generic_phase ()
#5  0x000056113a093645 in ngx_http_core_run_phases ()
#6  0x000056113a08f03e in ngx_http_process_request ()
#7  0x000056113a09068c in ngx_http_process_request_line ()
#8  0x000056113a0a6be1 in ngx_epoll_process_events.lto_priv ()
#9  0x000056113a0c45df in ngx_process_events_and_timers ()
#10 0x000056113a0a8405 in ngx_worker_process_cycle ()
#11 0x000056113a0be07f in ngx_spawn_process ()
#12 0x000056113a0a9715 in ngx_master_process_cycle ()
#13 0x0000561139f8a654 in main ()
aledbf commented 6 years ago

@rnburn can you help with this?

rnburn commented 6 years ago

Yep. Will take a look.

aledbf commented 6 years ago

@trnl thank you for the detailed report.

trnl commented 6 years ago

On another cluster, I can't reproduce it, and the only difference which I noticed is the:

> uname -a
Linux ... 3.10.0-693.2.2.el7.x86_64 #1 SMP Sat Sep 9 03:55:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
trnl commented 6 years ago

@rnburn , maybe I can provide some additional information?

rnburn commented 6 years ago

@trnl - The information is good. I can tell it's crashing here. I haven't had a chance yet to investigate further but will try to fix this week.

rnburn commented 6 years ago

I looked into this and I couldn't find any nginx documentation that says you can't modify headers_in like this line of the code does, but I inquired on nginx-devel to see if there's anything I'm missing.

Something like a smaller isolated example that reproduces the problem or a core dump could help to narrow the problem down.

aledbf commented 6 years ago

@rnburn that was fast http://mailman.nginx.org/pipermail/nginx-devel/2018-March/010988.html

aledbf commented 6 years ago

@rnburn any update on this?

rnburn commented 6 years ago

Hey @aledbf - I have plans to put in a fix based off of this answer I got on the nginx mailing list: http://mailman.nginx.org/pipermail/nginx-devel/2018-March/011008.html

rnburn commented 6 years ago

@aledbf - I put in https://github.com/opentracing-contrib/nginx-opentracing/pull/37 with a fix for this.

The fix required doing some reorganization of the module. I removed the vendor-specific modules in favor of loading in the tracer's dynamically with opentracing_load_tracer from a JSON representation of their configuration.

I can put in a PR to update ingress-nginx to do this.

I think this approach should work better for you since you can use an embedded YAML configuration for the tracer without having any code to worry about handling a tracer's specific options.

aledbf commented 6 years ago

@rnburn thank you for the update, the new json file makes sense.

I can put in a PR to update ingress-nginx to do this.

Sure. First let me update the nginx image (I am updating the modules) (https://github.com/kubernetes/ingress-nginx/pull/2537)

aledbf commented 6 years ago

@rnburn what's the format of the new json configuration for zipkin

rnburn commented 6 years ago

It's got a schema file for the configuration: https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/zipkin_opentracing/tracer_configuration.schema.json

aledbf commented 6 years ago

@trnl can you test quay.io/aledbf/nginx-ingress-controller:0.382 ? It contains the fix for the bug you reported. Thanks in advance