openfaas / faas-swarm

OpenFaaS provider for Docker Swarm
https://github.com/openfaas/faas
MIT License
81 stars 37 forks source link

Auto-restart functions #69

Open PeriGK opened 4 years ago

PeriGK commented 4 years ago

My actions before raising this issue

Hi,

I have written a few functions in openfaas. Some of those are not used very frequently. This morning I tried to send a request to a function that was not touched (HTTP request/build/deploy) for a few weeks.

The function never brought up.

Some facts that came from my investigation:

The docker service ps command returns a shutdown state. docker service inspect returns a MaxAttempts of 5 in the RestartPolicy, which might be related or not.

In the meantime, as we are speaking about local environments, I have shut down my machine every night, which I suppose is affecting the issue one way or another.

Are any of those related? What about the read_timeout/write_timeout settings?

Expected Behaviour

The function to recover as a reaction to the invoke/http request, of course with some expected delay.

Of course, it is all going back to normal if I do a build-deploy again from the faas-cli (new function with the same contents like the old one). But of course I would like this to happen without any manual intervention.

Current Behaviour

The function is not recovering from down state.

Possible Solution

Steps to Reproduce (for bugs)

  1. Setup a local function. A plain return {"hello": "world"} would suffice.
  2. Leave the function idle for a couple of hours and make sure there is at least a machine restart in the meantime. You may force it by shutting down the docker service which serves the function.
  3. Try to reach the function again
  4. the function is not waking up

Context

I understand this is a common concern, so I don't think this is a bug, rather a lack of my understanding or documentation.

So my questions are:

Your Environment

CLI: commit: 73004c23e5a4d3fdb7352f953247473477477a64 version: 0.11.3

Gateway uri: http://127.0.0.1:8080 version: 0.18.10 sha: 80b6976c106370a7081b2f8e9099a6ea9638e1f3 commit: Update Golang versions to 1.12

Provider name: faas-swarm orchestration: swarm version: 0.8.2 sha: 47988f8ba284678f3eb86eb62f75f72bafeec4d9 Your faas-cli version (0.11.3) may be out of date. Version: 0.12.2 is now available on GitHub.


* Docker version `docker version` (e.g. Docker 17.0.05 ):
```docker version
Client: Docker Engine - Community
 Version:           19.03.1
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        74b1e89
 Built:             Thu Jul 25 21:21:05 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.1
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.5
  Git commit:       74b1e89
  Built:            Thu Jul 25 21:19:41 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Thanks, P.

alexellis commented 4 years ago

Hi thanks for your interest

Unfortunately unless you fill out the issue template including "Steps to Reproduce", then we're unlikely to be able to help.

Please could you do that?

Thanks

PeriGK commented 4 years ago

Hi @alexellis sorry about that, I forgot. I fixed it now

PeriGK commented 4 years ago

More details:

I spotted that for those containers that fit the problem have the following error in docker service ps service_name. Check the Error column

ID                  NAME                IMAGE                     NODE                DESIRED STATE       CURRENT STATE         ERROR                              PORTS
3omhdi90dsvg        wordcount.1         functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
obee7cwi3ffa         \_ wordcount.1     functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
nf1i6pwoct3v         \_ wordcount.1     functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
sai0wpxjx0z3         \_ wordcount.1     functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
j9c0yzil6idv         \_ wordcount.1     functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
opfuml973pxs         \_ wordcount.1     functions/alpine:latest   moro-dell           Shutdown            Failed 2 months ago   "No such container: wordcount.…"   
PeriGK commented 4 years ago

Hi @alexellis

I did some more investigation. I managed to reproduce it with a function which was working in the afternoon but not in the morning. Looks like the swarm managed couldn't recover the container after shutting down my machine.

Not sure if you have any input on that, but I don't see any other explanation.

Thanks, P.