openfaas / of-watchdog

Reverse proxy for STDIO and HTTP microservices
MIT License
259 stars 115 forks source link

panic due to SIGSEGV while killing function process in streaming mode #138

Closed cmacq2 closed 1 year ago

cmacq2 commented 1 year ago

Expected Behaviour

I'd expect the OpenFAAS watchdog to be able to kill the function process reliably, and not attempt to kill processes which have terminated already.

Current Behaviour

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4cf2b9]

goroutine 10112 [running]:
os.(*Process).signal(0x896903, {0x91dc68, 0xbbf320})
    /opt/hostedtoolcache/go/1.17.9/x64/src/os/exec_unix.go:64 +0x39
os.(*Process).Signal(...)
    /opt/hostedtoolcache/go/1.17.9/x64/src/os/exec.go:138
os.(*Process).kill(...)
    /opt/hostedtoolcache/go/1.17.9/x64/src/os/exec_posix.go:68
os.(*Process).Kill(...)
    /opt/hostedtoolcache/go/1.17.9/x64/src/os/exec.go:123
github.com/openfaas/of-watchdog/executor.(*ForkFunctionRunner).Run.func1()
    /home/runner/work/of-watchdog/of-watchdog/executor/forking_runner.go:53 +0x9f
created by github.com/openfaas/of-watchdog/executor.(*ForkFunctionRunner).Run
    /home/runner/work/of-watchdog/of-watchdog/executor/forking_runner.go:48 +0x1c8

Steps to Reproduce (for bugs)

  1. Generate an image based on ghcr.io/openfaas/of-watchdog:0.9.6, including an additional static binary
  2. Specify the following environment variables: mode=streaming, content_type=application/json, function_process=/path/to/my-function-binary
  3. Define a Function resource in a Helm chart, specifying additionally max_inflight: "1", exec_timeout: 0.1s
  4. Deploy this Helm chart to a Kubernetes cluster where logs are being scraped and forwarded to some collector (ends up in ElastichSearch/Kibana in my case).
  5. Subject it to a moderate load using e.g. Gatling (100 concurrent requests per second for 4 mins)
  6. Observe infrequent crashes/panics in the Watchdog.

Context

The idea behind the set up is to have a sandboxed process that operates on untrusted input, hence the max_inflight (isolation) and exec_timeout (resource consumption limiting) parameters.

The test is meant to observe primarily whether or not this setup "survives" a continuous load of around 50 - 100 requests per seconds without triggering any pod restarts. The current set up is obviously not successful yet, so I've been digging through log output to see what might cause that and saw the SIGSEGV messages as a result.

I also see some other log output that may or may not be related. This does not appear to be a direct correlation, more "other general stuff I would not expect to see and may be of interest":

Error killing function due to ExecTimeout os: process already finished

And:

SIGTERM: no new connections in 10s

As well as:

No new connections allowed, draining: 0 requests

And:

Exiting. Active connections: 0

Your Environment

alexellis commented 1 year ago

Hi @cmacq2

Thanks for your interest in contributing to and using OpenFaaS

We'd generally ask you to take a brief moment to introduce yourself, see also: First impressions - introducing yourself and your use-case

For steps to reproduce, we're looking for a code sample, a repository, a snippet. Something that means a maintainer doesn't have to read between the lines, and where there's no ambiguity.

Once we have a bit more context, we can move forward to verifying the solution.

It seems sensible to me to use a Context as you've done in your PR.

Alex