Closed brettplarson closed 3 years ago
I dug through our logs internally, we have several of these too, but they are happening after the child process starts. If the child process is started, scuttle just passes the signal. If it hasn't started it exits which is the issue you are seeing.
I don't really know what the point of SIGURG is, doing some Googling + that man page you linked it seems like the default functionality is to ignore the signal anyway.
Prior to the child process starting, I think Scuttle really only needs to handle SIGINT and anything else can be ignored. K8s will send a SIGINT to start a graceful shutdown before SIGKILL (which can't be handled and will always kill scuttle). So to answer your question, I think limiting the signal code to SIGINT should solve your issue.
I'm curious tho, is this happening a lot? is it always the same app?
@linjmeyer - Thanks for the PR!!
It looks like Golang uses SIGURG for preemption. I am not a Go expert and not a developer but I suspect this signal would be come up a lot - here is a reference to the golang code
// sigPreempt is the signal used for non-cooperative preemption. [...] // We use SIGURG because it meets all of these criteria, is extremely // unlikely to be used by an application for its "real" meaning (both // because out-of-band data is basically unused and because SIGURG // doesn't report which socket has the condition, making it pretty // useless), and even if it is, the application has to be ready for // spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more // likely to be used for real. const sigPreempt = _SIGURG
Here is another Github issue thread that helped explain the issue to me. It seems to be a core design choice to use this signal and it may not even be coming from
It looks like containerd also removed this signal in a recent PR.
In the last hour I see it happening around 104 times across our main production environments, which maybe doesn't help you get a grasp of the issue.
The two instances I looked into showed that the istio was started then killed and the workload was launched afterwards. The workload got connection refused trying to talk to our services, even though our services were up.
It's possible that something else is at play, as I would expect to see a lot more angry users if this happened every few seconds, but we do run a lot of ephemeral pods and it could possibly not be affecting certain workloads.
Happy to look into this more next week or guinea pig the fix :smile_cat:
Nice finds! Pretty interesting, I'm thinking a lot if not all of them are from the Go runtime using that signal for it's own purposes then. I think that PR is still the best way forward then. I can say from looking at logs internally it seems like python/dotnet runtimes don't mind the signals being passed in so I assume they are ignored. Could do an additional PR to filter them out completely if we find it causes some issues with the child proc.
Thank you @linjmeyer and @brettplarson !
@brettplarson v1.3.4
was just released with that bugfix, can you give it a try and let us know if you still have the issue? Thanks!
We are seeing a strange issue with scuttle in which we are seeing a mysterious signal "urgent i/o condition" being sent to scuttle and causing envoy / istio sidecar to stop as soon as it's started. The issue is summarized with these lines lines - the first showing the istio-proxy is read, the second scuttle acknowledging this, the 3rd, getting the signal and the rest showing that it's quitting.
Here is an full export of the logs from the node, and container as well as our istiod pods:
We are using Scuttle v1.3.1.
My understanding is this signal is
from the man pages.
I am trying to get as much information as I can to understand this issue, but ultimately I would like to know if it is it possible to ignore this signal? Or increase logging to determine where it's receiving this from?
Any background on why this would occur is appreciated.
Please let me know, Thank you!