redboxllc / scuttle

A wrapper for applications to help with running Istio Sidecars
MIT License
116 stars 25 forks source link

Unknown SIGURG (urgent I/O condition) signal causing Scuttle / Istio to die #44

Closed brettplarson closed 3 years ago

brettplarson commented 3 years ago

We are seeing a strange issue with scuttle in which we are seeing a mysterious signal "urgent i/o condition" being sent to scuttle and causing envoy / istio sidecar to stop as soon as it's started. The issue is summarized with these lines lines - the first showing the istio-proxy is read, the second scuttle acknowledging this, the 3rd, getting the signal and the rest showing that it's quitting.

2021-03-25 11:06:37.934 istio-proxy 2021-03-25T11:06:37.934856Z info Envoy proxy is ready 
2021-03-25 11:06:38.267 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Blocking finished, Envoy has started 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Received signal 'urgent I/O condition', exiting 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Kill received: (Action: Stopping Istio with API, Reason: ISTIO_QUIT_API is set, Exit Code: 1) 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Stopping Istio using Istio API 'http://127.0.0.1:15000' (intended for Istio >v1.2) 
2021-03-25 11:06:38.316 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Sent quitquitquit to Istio, status code: 200 

Here is an full export of the logs from the node, and container as well as our istiod pods:

@timestamp  Node / container logs   Log
2021-03-25 11:06:21.123 ec2node-x   ena 0000:00:07.0 eth2: Local page cache is disabled for less than 16 channels
2021-03-25 11:06:26.893 ec2node-x   http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11) 
2021-03-25 11:06:32.581 ec2node-x   I0325 11:06:32.581736    7921 setters.go:77] Using node IP: "10.234.36.215"
2021-03-25 11:06:33.862 ec2node-x   {"level":"info","ts":"2021-03-25T11:06:33.862Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
2021-03-25 11:06:33.933 ec2node-x   I0325 11:06:33.932823    7921 prober.go:124] Readiness probe for "e24a559ffa452bb7e284a4f3690fa5d3_namespace(4e64d45d-52ea-4c19-a4b2-cda5b1b72c09):istio-proxy" failed (failure): Get http://100.66.110.107:15021/healthz/ready: dial tcp 100.66.110.107:15021: connect: connection refused
2021-03-25 11:06:35.933 ec2node-x   I0325 11:06:35.932833    7921 prober.go:124] Readiness probe for "e24a559ffa452bb7e284a4f3690fa5d3_namespace(4e64d45d-52ea-4c19-a4b2-cda5b1b72c09):istio-proxy" failed (failure): Get http://100.66.110.107:15021/healthz/ready: dial tcp 100.66.110.107:15021: connect: connection refused
2021-03-25 11:06:36.736 istio-proxy 2021-03-25T11:06:36.735981Z warning envoy runtime Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size 
2021-03-25 11:06:36.736 istio-proxy 2021-03-25T11:06:36.736016Z warning envoy runtime Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size 
2021-03-25 11:06:36.736 istio-proxy 2021-03-25T11:06:36.736506Z warning envoy runtime Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size 
2021-03-25 11:06:36.736 istio-proxy 2021-03-25T11:06:36.736543Z warning envoy runtime Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size 
2021-03-25 11:06:36.765 istio-proxy 2021-03-25T11:06:36.765549Z info xdsproxy Envoy ADS stream established 
2021-03-25 11:06:36.765 istio-proxy 2021-03-25T11:06:36.765651Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012 
2021-03-25 11:06:36.767 istio-proxy 2021-03-25T11:06:36.767262Z warning envoy main there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections 
2021-03-25 11:06:36.777 discovery   2021-03-25T11:06:36.777047Z info ads ADS: new connection for node:sidecar~100.66.110.107~e24a559ffa452bb7e284a4f3690fa5d3.namespace~namespace.svc.cluster.local-5 
2021-03-25 11:06:36.780 discovery   2021-03-25T11:06:36.780741Z info ads CDS: PUSH for node:e24a559ffa452bb7e284a4f3690fa5d3.namespace resources:196 
2021-03-25 11:06:36.846 discovery   2021-03-25T11:06:36.846468Z info ads EDS: PUSH for node:e24a559ffa452bb7e284a4f3690fa5d3.namespace resources:133 empty:0 cached:133/133 
2021-03-25 11:06:36.854 istio-proxy 2021-03-25T11:06:36.854774Z info sds resource:default new connection 
2021-03-25 11:06:36.854 istio-proxy 2021-03-25T11:06:36.854786Z info sds resource:ROOTCA new connection 
2021-03-25 11:06:36.854 istio-proxy 2021-03-25T11:06:36.854832Z info sds Skipping waiting for gateway secret 
2021-03-25 11:06:36.854 istio-proxy 2021-03-25T11:06:36.854837Z info sds Skipping waiting for gateway secret 
2021-03-25 11:06:36.933 istio-proxy 2021-03-25T11:06:36.933776Z info cache Root cert has changed, start rotating root cert for SDS clients 
2021-03-25 11:06:36.933 istio-proxy 2021-03-25T11:06:36.933801Z info cache GenerateSecret default 
2021-03-25 11:06:36.934 istio-proxy 2021-03-25T11:06:36.934135Z info sds resource:default pushed key/cert pair to proxy 
2021-03-25 11:06:37.055 istio-proxy 2021-03-25T11:06:37.054956Z info cache Loaded root cert from certificate ROOTCA 
2021-03-25 11:06:37.055 istio-proxy 2021-03-25T11:06:37.055140Z info sds resource:ROOTCA pushed root cert to proxy 
2021-03-25 11:06:37.083 discovery   2021-03-25T11:06:37.082941Z info ads LDS: PUSH for node:e24a559ffa452bb7e284a4f3690fa5d3.namespace resources:163 
2021-03-25 11:06:37.336 discovery   2021-03-25T11:06:37.332927Z info ads RDS: PUSH for node:e24a559ffa452bb7e284a4f3690fa5d3.namespace resources:60 
2021-03-25 11:06:37.934 istio-proxy 2021-03-25T11:06:37.934856Z info Envoy proxy is ready 
2021-03-25 11:06:38.267 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Blocking finished, Envoy has started 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Received signal 'urgent I/O condition', exiting 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Kill received: (Action: Stopping Istio with API, Reason: ISTIO_QUIT_API is set, Exit Code: 1) 
2021-03-25 11:06:38.295 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Stopping Istio using Istio API 'http://127.0.0.1:15000' (intended for Istio >v1.2) 
2021-03-25 11:06:38.316 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Sent quitquitquit to Istio, status code: 200 
2021-03-25 11:06:38.318 discovery   2021-03-25T11:06:38.318887Z info ads ADS: "100.66.110.107:58984" sidecar~100.66.110.107~e24a559ffa452bb7e284a4f3690fa5d3.namespace~namespace.svc.cluster.local-5 terminated with stream closed 
2021-03-25 11:06:38.318 istio-proxy 2021-03-25T11:06:38.318183Z warning envoy config StreamAggregatedResources gRPC config stream closed: 13,  
2021-03-25 11:06:38.318 istio-proxy 2021-03-25T11:06:38.318470Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012 
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319013Z warning envoy config StreamSecrets gRPC config stream closed: 13,  
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319029Z warning envoy config StreamSecrets gRPC config stream closed: 13,  
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319067Z info sds resource:ROOTCA connection is terminated: rpc error: code = Canceled desc = context canceled 
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319086Z error sds Remote side closed connection 
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319067Z info sds resource:default connection is terminated: rpc error: code = Canceled desc = context canceled 
2021-03-25 11:06:38.319 istio-proxy 2021-03-25T11:06:38.319115Z error sds Remote side closed connection 
2021-03-25 11:06:38.402 istio-proxy 2021-03-25T11:06:38.402249Z info Epoch 0 exited normally 
2021-03-25 11:06:38.402 istio-proxy 2021-03-25T11:06:38.402272Z info No more active epochs, terminating 
2021-03-25 11:06:38.460 ec2node-x   time="2021-03-25T11:06:38.460784673Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
2021-03-25 11:06:38.680 e24a559ffa452bb7e284a4f3690fa5d3    2021-03-25T11:06:38Z scuttle: Received signal 'urgent I/O condition', passing to child 
2021-03-25 11:06:38.872 ec2node-x   {"level":"info","ts":"2021-03-25T11:06:38.872Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
2021-03-25 11:06:38.991 ec2node-x   I0325 11:06:38.990897    7921 kubelet.go:1960] SyncLoop (PLEG): "e24a559ffa452bb7e284a4f3690fa5d3_namespace(4e64d45d-52ea-4c19-a4b2-cda5b1b72c09)", event: &pleg.PodLifecycleEvent{ID:"4e64d45d-52ea-4c19-a4b2-cda5b1b72c09", Type:"ContainerDied", Data:"80a75a8a226d881053607260c8837ac9c1627a888b1443aa4186528c22219898"}
2021-03-25 11:06:38.991 ec2node-x   I0325 11:06:38.991714    7921 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 80a75a8a226d881053607260c8837ac9c1627a888b1443aa4186528c22219898

We are using Scuttle v1.3.1.

My understanding is this signal is

SIGURG       P2001      Ign     Urgent condition on socket (4.2BSD)

from the man pages.

I am trying to get as much information as I can to understand this issue, but ultimately I would like to know if it is it possible to ignore this signal? Or increase logging to determine where it's receiving this from?

Any background on why this would occur is appreciated.

Please let me know, Thank you!

linjmeyer commented 3 years ago

I dug through our logs internally, we have several of these too, but they are happening after the child process starts. If the child process is started, scuttle just passes the signal. If it hasn't started it exits which is the issue you are seeing.

I don't really know what the point of SIGURG is, doing some Googling + that man page you linked it seems like the default functionality is to ignore the signal anyway.

Prior to the child process starting, I think Scuttle really only needs to handle SIGINT and anything else can be ignored. K8s will send a SIGINT to start a graceful shutdown before SIGKILL (which can't be handled and will always kill scuttle). So to answer your question, I think limiting the signal code to SIGINT should solve your issue.

I'm curious tho, is this happening a lot? is it always the same app?

brettplarson commented 3 years ago

@linjmeyer - Thanks for the PR!!

It looks like Golang uses SIGURG for preemption. I am not a Go expert and not a developer but I suspect this signal would be come up a lot - here is a reference to the golang code

// sigPreempt is the signal used for non-cooperative preemption. [...] // We use SIGURG because it meets all of these criteria, is extremely // unlikely to be used by an application for its "real" meaning (both // because out-of-band data is basically unused and because SIGURG // doesn't report which socket has the condition, making it pretty // useless), and even if it is, the application has to be ready for // spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more // likely to be used for real. const sigPreempt = _SIGURG

Here is another Github issue thread that helped explain the issue to me. It seems to be a core design choice to use this signal and it may not even be coming from

It looks like containerd also removed this signal in a recent PR.

In the last hour I see it happening around 104 times across our main production environments, which maybe doesn't help you get a grasp of the issue.

The two instances I looked into showed that the istio was started then killed and the workload was launched afterwards. The workload got connection refused trying to talk to our services, even though our services were up.

It's possible that something else is at play, as I would expect to see a lot more angry users if this happened every few seconds, but we do run a lot of ephemeral pods and it could possibly not be affecting certain workloads.

Happy to look into this more next week or guinea pig the fix :smile_cat:

linjmeyer commented 3 years ago

Nice finds! Pretty interesting, I'm thinking a lot if not all of them are from the Go runtime using that signal for it's own purposes then. I think that PR is still the best way forward then. I can say from looking at logs internally it seems like python/dotnet runtimes don't mind the signals being passed in so I assume they are ignored. Could do an additional PR to filter them out completely if we find it causes some issues with the child proc.

JoeJasinski commented 3 years ago

Thank you @linjmeyer and @brettplarson !

linjmeyer commented 3 years ago

@brettplarson v1.3.4 was just released with that bugfix, can you give it a try and let us know if you still have the issue? Thanks!