openfaas / nats-queue-worker

Queue-worker for OpenFaaS with NATS Streaming
https://docs.openfaas.com/reference/async/
MIT License
128 stars 59 forks source link

SIGSEGV in queue-worker when remove/deploy (redeploy) function that is being called #43

Closed pyramation closed 5 years ago

pyramation commented 5 years ago

I have a service that is trying to call asynchronous functions at the same time those functions are being re-deployed.

Expected Behaviour

I would expect a caught error rather than seg fault or null pointer ref.

Current Behaviour

The queue-worker seems to have a fatal crash. Good news is, once the functions are re-deployed, everything does go back to normal.

Post http://ping.openfaas-fn.svc.cluster.local:8080/: dial tcp 10.102.218.40:8080: i/o timeout
Callback to: http://gateway.openfaas:8080/function/task-complete
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x69b878]

goroutine 19 [running]:
main.postResult(0xc42005da40, 0x0, 0x0, 0x0, 0x0, 0xc42040cc80, 0x33, 0xc4204071d0, 0x24, 0x0, ...)
    /go/src/github.com/openfaas/nats-queue-worker/main.go:239 +0xb8
main.main.func1(0xc4204099e0)
    /go/src/github.com/openfaas/nats-queue-worker/main.go:122 +0x8ba
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.(*conn).processMsg(0xc4200e6380, 0xc42040e280)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming/stan.go:751 +0x26f
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.(*conn).(github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.processMsg)-fm(0xc42040e280)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming/sub.go:228 +0x34
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats.(*Conn).waitForMsgs(0xc42009e500, 0xc420152240)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats/nats.go:1652 +0x24a
created by github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats.(*Conn).subscribe
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats/nats.go:2374 +0x4de
rpc error: code = Unknown desc = Error: No such container: e99f5b51bcca8ca95b02291e051c9c2b1320e8c69ca7086d350f273cd63209a8

Steps to Reproduce (for bugs)

  1. create a function foo
  2. create a service that calls the foo function asynchronously every few seconds
  3. tail the worker-queue logs
  4. now, faas remove then faas deploy the function foo

Context

Implementing a jobs system. If any of those jobs needs to be re-deployed, I imagine in production having to remove and deploy.

Potentially there is a more graceful way to remove and deploy, but I imagine we still wouldn't want something as core as the nats-queue-worker to have a fatal crash in this situation, given the fact that other functions that need to be called async may still be deployed, and it would be hard to notify external services and tell them to stop calling the functions during a (re)deployment.

Your Environment

Docker for mac (k8s)

alexellis commented 5 years ago

I'll see if someone can take a look at this.

Can you give instructions for reproducing it? I.e. s code repo? It's not clear which function you are removing or which one is running every through seconds.

We should be able to sort out the nil pointer.

pyramation commented 5 years ago

@alexellis sure thing, let me get an example going.

pyramation commented 5 years ago

@alexellis https://github.com/pyramation/nats-queue-issue-43

pyramation commented 5 years ago

Specifically, here is where the crash happens: https://github.com/pyramation/nats-queue-issue-43/blob/master/scripts/crash.sh#L26

alexellis commented 5 years ago

One thing to note is that Kubernetes doesn't do anything new instant deletion, so I recommend using a rolling update instead of a delete/create.

alexellis commented 5 years ago

I believe this is resolved with the latest changes in master. Please can someone confirm?

LucasRoesler commented 5 years ago

For future reference, the related PR is #44

ewilde commented 5 years ago

I verified that the script supplied by @pyramation causes a panic before #44 @alexellis I verified the fix from #44 works and the panic does not occur when running the same script

ewilde commented 5 years ago

Derek close: fixed in latest release

alexellis commented 5 years ago

Derek close

alexellis commented 5 years ago

Wonder if you're in the file @ewilde?