teo-chenglim commented 5 years ago

OpenFaas timeout issue on GKE is 30 sec max. It doesn't seems like a OpenFaas settings but Kubernetes settings. More than 10 users report the same things and always closed with no responses. I am using Helm for deployment.

Please help to take a look this time and I am providing comprehensive troubleshooting steps and please update troubleshooting steps to include this as needed.

I do understand that find this is to set OpenFaas time out. in facts this is my current settings. Why I put at both location? Because some document say it is out side of environment. I tried with " and with "s" with all combination

functions:
  duckling:                  ## The openfaas endpoint name
    lang: dockerfile
    handler: ./function             ## The folder of your Dockerfile, source code and requirement.txt
    image: gcr.io/amaris-kubernetes-prod/duckling:latest ## Caution! this is to upload to public repo
    labels:
      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 15
      com.openfaas.scale.factor: 50
    read_timeout: "300s"  # default is 5s, Maximum time to read HTTP request
    write_timeout: "300s"  # default is 5s, Maximum time to write HTTP response
    upstream_timeout: "300s" # Maximum duration of upstream function call
    exec_timeout: "300s"
    environment:
      read_timeout: "300s"  # default is 5s, Maximum time to read HTTP request
      write_timeout: "300s"  # default is 5s, Maximum time to write HTTP response
      upstream_timeout: "300s" # Maximum duration of upstream function call
      exec_timeout: "300s"

Expected Behaviour

Program running inside Kubenetes pods

ubuntu@duckling-85f846df8c-bn62l:~/function$ time echo "Today is Monday" | python3 index.py 
[{'dim': 'time', 'text': 'Monday', 'start': 9, 'end': 15, 'value': {'value': '2019-01-14T00:00:00.000Z', 'grain': 'day', 'others': [{'grain': 'day', 'value': '2019-01-07T00:00:00.000Z'}, {'grain': 'day', 'value': '2019-01-14T00:00:00.000Z'}, {'grain': 'day', 'value': '2019-01-21T00:00:00.000Z'}]}}, {'dim': 'time', 'text': 'Today', 'start': 0, 'end': 5, 'value': {'value': '2019-01-07T00:00:00.000Z', 'grain': 'day', 'others': [{'grain': 'day', 'value': '2019-01-07T00:00:00.000Z'}]}}]

real    2m50.517s [it takes 125 seconds]
user    1m0.818s
sys 0m1.127s

$ time curl -s https://[endpoint]/function/duckling -d "today meeting is at 11am"

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

real    0m30.067s
user    0m0.014s
sys 0m0.000s
$ time curl -s https://[endpoint]/function/duckling -d "today meeting is at 11am"

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

real    0m30.064s
user    0m0.007s
sys 0m0.008s

Kubernetes has a timeout of 30 seconds. I tried many place but still can't find where it is

$ kubectl logs -n openfaas-fn deploy/duckling
2019/01/07 07:28:39 Version: 0.9.14 SHA: a65df4795bc66147c41161c48bfd4c72f60c7434
2019/01/07 07:28:39 Read/write timeout: 5m0s, 5m0s. Port: 8080
2019/01/07 07:28:39 Writing lock-file to: /tmp/.lock
2019/01/07 07:28:55 Forking fprocess.
2019/01/07 07:30:11 Wrote 690 Bytes - Duration: 75.473247 seconds
2019/01/07 07:30:24 Forking fprocess.
2019/01/07 07:33:19 Wrote 890 Bytes - Duration: 175.281187 seconds
2019/01/07 07:36:38 Forking fprocess.
2019/01/07 07:37:41 Wrote 890 Bytes - Duration: 62.304900 seconds
2019/01/07 07:39:18 Forking fprocess.
2019/01/07 07:40:24 Wrote 890 Bytes - Duration: 66.362533 seconds
2019/01/07 07:54:00 Forking fprocess.
2019/01/07 07:55:04 Wrote 890 Bytes - Duration: 64.284393 seconds
2019/01/07 07:55:21 Forking fprocess.
2019/01/07 07:56:26 Wrote 890 Bytes - Duration: 64.773979 seconds
2019/01/07 07:59:21 Forking fprocess.
2019/01/07 08:00:22 Wrote 890 Bytes - Duration: 61.696822 seconds
2019/01/07 08:01:15 Forking fprocess.
2019/01/07 08:01:15 Wrote 25 Bytes - Duration: 0.073726 seconds [I changed my function here to just write output]
2019/01/07 08:01:40 Forking fprocess.
2019/01/07 08:02:44 Wrote 25 Bytes - Duration: 63.800074 seconds
2019/01/07 08:16:14 Forking fprocess.
2019/01/07 08:17:20 Wrote 25 Bytes - Duration: 66.350439 seconds
2019/01/07 08:35:14 Forking fprocess.
2019/01/07 08:36:15 Wrote 25 Bytes - Duration: 61.254206 seconds
2019/01/07 09:22:09 Forking fprocess.
2019/01/07 09:23:18 Wrote 25 Bytes - Duration: 69.571718 seconds
2019/01/07 09:25:40 Forking fprocess.

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Creating a Kubernetes clusers
Use helm to deploy OpenFaaS
Create an nginx Ingress with HTTPS

Context

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):

$ faas-cli version
  ___                   _____           ____
 / _ \ _ __   ___ _ __ |  ___|_ _  __ _/ ___|
| | | | '_ \ / _ \ '_ \| |_ / _` |/ _` \___ \
| |_| | |_) |  __/ | | |  _| (_| | (_| |___) |
 \___/| .__/ \___|_| |_|_|  \__,_|\__,_|____/
      |_|

CLI:
 commit:  b24c5763d9b61e0c04018a722f8f2f765498f18a
 version: 0.7.8

Docker version docker version (e.g. Docker 17.0.05 ):

$ docker version
Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:49:01 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false
$ docker version
Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:49:01 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

FaaS-Netes (Helm), nginx ingress

Operating System and version (e.g. Linux, Windows, MacOS):

Ubuntu 16.04

Link to your project or a code example to reproduce issue:

Any program run more than 30 seconds breaks.

Please also follow the [troubleshooting guide]

(https://github.com/openfaas/faas/blob/master/guide/troubleshooting.md) and paste in any other diagnostic information you have:

I have follow this guide more than 20 times.

$ kubectl get deploy -n openfaas  
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
alertmanager   1/1     1            1           82d
caddy-tls      1/1     1            1           82d
faas-netesd    1/1     1            1           82d
gateway        1/1     1            1           82d
kube-lego      1/1     1            1           82d
nats           1/1     1            1           82d
prometheus     1/1     1            1           82d
queue-worker   3/3     3            3           82d

Possible gateway timeout

$ kubectl get deploy/gateway -n openfaas -o yaml | grep timeout -A2
        - name: write_timeout
          value: 300s
        - name: read_timeout
          value: 300s
        - name: upstream_timeout
          value: 240s
        image: functions/gateway:0.7.8

Possible Ingress timeout

$ kubectl get ingress -n openfaas -o yaml | grep timeout
      ingress.kubernetes.io/proxy-connect-timeout: "300"
      ingress.kubernetes.io/proxy-read-timeout: "300"
      ingress.kubernetes.io/proxy-send-timeout: "300"
      ingress.kubernetes.io/upstream-fail-timeout: "300"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"

teo-chenglim commented 5 years ago

It seems like only alertmanager and prometheus has the 30 seconds timeout limit. But I am not sure how to edit it

chenglim@chenglim-GL503VM:/faas-netes/chart/openfaas$ grep -in 30 */*
templates/alertmanager-cfg.yaml:18:      repeat_interval: 30s
templates/alertmanager-dep.yaml:35:            - --timeout=30
templates/alertmanager-dep.yaml:38:          timeoutSeconds: 30
templates/alertmanager-dep.yaml:45:            - --timeout=30
templates/alertmanager-dep.yaml:48:          timeoutSeconds: 30
templates/prometheus-dep.yaml:34:            - --timeout=30
templates/prometheus-dep.yaml:37:          timeoutSeconds: 30
templates/prometheus-dep.yaml:44:            - --timeout=30
templates/prometheus-dep.yaml:47:          timeoutSeconds: 30

teo-chenglim commented 5 years ago

$ kubectl get configmap/alertmanager-config -n openfaas -o yaml
apiVersion: v1
data:
  alertmanager.yml: |
    route:
      group_by: ['alertname', 'cluster', 'service']sh
      group_wait: 5s
      group_interval: 10s
      repeat_interval: 40s
      receiver: scale-up
      routes:
      - match:
          service: gateway
          receiver: scale-up
          severity: major
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'cluster', 'service']
    receivers:
    - name: 'scale-up'
      webhook_configs:
        - url: http://gateway.openfaas:8080/system/alert
          send_resolved: true
kind: ConfigMap
metadata:
  creationTimestamp: "2018-10-17T01:51:38Z"
  labels:
    app: alertmanager
  name: alertmanager-config
  namespace: openfaas
  resourceVersion: "9657074"
  selfLink: /api/v1/namespaces/openfaas/configmaps/alertmanager-config
  uid: 2f87f660-d1af-11e8-a38b-42010a940128
$ kubectl get deploy/alertmanager -n openfaas -o yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
  creationTimestamp: "2018-10-17T01:51:38Z"
  generation: 2
  labels:
    app: alertmanager
  name: alertmanager
  namespace: openfaas
  resourceVersion: "9657435"
  selfLink: /apis/extensions/v1beta1/namespaces/openfaas/deployments/alertmanager
  uid: 2f8a166b-d1af-11e8-a38b-42010a940128
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: alertmanager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: alertmanager
    spec:
      containers:
      - command:
        - alertmanager
        - --config.file=/alertmanager.yml
        - --storage.path=/alertmanager
        image: prom/alertmanager:v0.15.0-rc.0
        imagePullPolicy: Always
        name: alertmanager
        ports:
        - containerPort: 9093
          protocol: TCP
        resources:
          limits:
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /alertmanager.yml
          name: alertmanager-config
          subPath: alertmanager.yml
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 40
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: alertmanager.yml
            mode: 420
            path: alertmanager.yml
          name: alertmanager-config
        name: alertmanager-config
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2018-12-23T15:01:48Z"
    lastUpdateTime: "2018-12-23T15:01:48Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2018-10-17T01:51:38Z"
    lastUpdateTime: "2019-01-08T01:01:55Z"
    message: ReplicaSet "alertmanager-7c66b6fc5d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
$

teo-chenglim commented 5 years ago

Where to edit the "readinessProbe" and "livenessProbe" at templates/alertmanager-dep.yaml without restarting the server? Does it has anything to do with the time out of 30 seconds?

alexellis commented 5 years ago

Hi @teo-chenglim,

Thank you for using OpenFaaS.

OpenFaas timeout issue on GKE is 30 sec max. It doesn't seems like a OpenFaas settings but Kubernetes settings. More than 10 users report the same things and always closed with no responses.

I am not sure what you are referring to? Can you please link to these 10 users?

If you would like help with setting timeouts I would suggest you read the troubleshooting guide very carefully and work methodically to make sure your timeouts are set everywhere where they are needed. https://docs.openfaas.com/deployment/troubleshooting/

I know that the timeouts work because we use them for OpenFaaS Cloud for doing container builds which take several minutes.

Example from OpenFaaS Cloud:

helm repo add openfaas https://openfaas.github.io/faas-netes

helm repo update && \
helm upgrade openfaas --install openfaas/openfaas \
    --namespace openfaas  \
    --set basic_auth=true \
    --set functionNamespace=openfaas-fn \
    --set ingress.enabled=true \
    --set gateway.scaleFromZero=true \
    --set gateway.readTimeout=300s \
    --set gateway.writeTimeout=300s \
    --set gateway.upstreamTimeout=295s \
    --set faasnetesd.readTimeout=300s \
    --set faasnetesd.writeTimeout=300s \
    --set gateway.replicas=2 \
    --set queueWorker.replicas=2

You then need to set read_timeout and write_timeout on each function you build with a Golang duration either with 5s or 5m for 5 seconds or 5 minutes.

Alex

alexellis commented 5 years ago

@martindekov you recently helped out another user with exactly the same issue, is there anything you can add?

Alex

alexellis commented 5 years ago

I'm going to close the issue because I'm not aware of any issues with configuring timeouts and we want to keep the issue tracker for urgent defects or features for the project. Please keep commenting @teo-chenglim and let us know how you get on.

martindekov commented 5 years ago

Hello @teo-chenglim !

Can you please remove quotations around the timeouts aka, "300s" -> 300s. About whether or not you should put s behind the values then answer is yes.

For me putting quotations around timeout values I get 504 error, I think this might be your issue.

Try kubectl describe svc <your service name here> -n openfaas-fn and check if values are present in that part:

...
    Environment:
      fprocess:       python index.py
      write_timeout:  60s
      read_timeout:   60s
...

Let me know the result, thanks!

rgee0 commented 5 years ago

I'd also suggest pulling down a later stack. gateway:0.7.8 is nearly a year old now. There have been an awful lot of changes in that time.

teo-chenglim commented 5 years ago

Hi all,

Thank you for the replied.

Still doesn't works. I think I tried everything you guy suggested.

$ kubectl get deploy/faas-netesd -n openfaas -o yaml | grep -i timeout -A1
        - name: write_timeout
          value: 300s
        - name: read_timeout
          value: 300s
$ kubectl get ingress -n openfaas -o yaml | grep timeout
      ingress.kubernetes.io/proxy-connect-timeout: 300s
      ingress.kubernetes.io/proxy-read-timeout: 300s
      ingress.kubernetes.io/proxy-send-timeout: 300s
      ingress.kubernetes.io/upstream-fail-timeout: 300s
      nginx.ingress.kubernetes.io/proxy-read-timeout: 600s
      nginx.ingress.kubernetes.io/proxy-send-timeout: 600s
$ kubectl get deploy/gateway -n openfaas -o yaml | grep -i timeout -A1
        - name: write_timeout
          value: 300s
        - name: read_timeout
          value: 300s
        - name: upstream_timeout
          value: 240s
$

Inside of containers

ubuntu@duckling-849cd6445f-gj7cs:~/function$ time echo "Lets meet today 11:00pm" | python index.py 
[{'dim': 'number', 'text': '11', 'start': 16, 'end': 18, 'value': {'value': 11.0}}, {'dim': 'number', 'text': '00', 'start': 19, 'end': 21, 'value': {'value': 0.0}}, {'dim': 'distance', 'text': '11', 'start': 16, 'end': 18, 'value': {'value': 11.0, 'unit': None}}, {'dim': 'distance', 'text': '00', 'start': 19, 'end': 21, 'value': {'value': 0.0, 'unit': None}}, {'dim': 'volume', 'text': '11', 'start': 16, 'end': 18, 'value': {'value': 11.0, 'unit': None, 'latent': True}}, {'dim': 'volume', 'text': '00', 'start': 19, 'end': 21, 'value': {'value': 0.0, 'unit': None, 'latent': True}}, {'dim': 'temperature', 'text': '11', 'start': 16, 'end': 18, 'value': {'value': 11.0, 'unit': None}}, {'dim': 'temperature', 'text': '00', 'start': 19, 'end': 21, 'value': {'value': 0.0, 'unit': None}}, {'dim': 'time', 'text': 'today 11:00pm', 'start': 10, 'end': 23, 'value': {'value': '2019-01-09T23:00:00.000Z', 'grain': 'minute', 'others': [{'grain': 'minute', 'value': '2019-01-09T23:00:00.000Z'}]}}]

real    1m50.534s
user    0m56.060s
sys 0m0.942s
amaris@duckling-849cd6445f-gj7cs:~/function$ env | grep -i timeout
exec_timeout=300s
read_timeout=300s
upstream_timeout=300s
write_timeout=300s

Using curl is still timeout

$ time curl --connect-timeout 300 -s https://faas.amaris.ai/function/duckling -d "Lets meet today at 11:00pm"

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

real    0m30.067s
user    0m0.007s
sys 0m0.007s

Watch dog version

RUN curl -sSL https://github.com/openfaas/faas/releases/download/0.9.14/fwatchdog > /usr/bin/fwatchdog \
    && chmod +x /usr/bin/fwatchdog

teo-chenglim commented 5 years ago

@martindekov mine error is 502 not 504..

rgee0 commented 5 years ago

If you are unable to share the repo, could you at least provide the stack.yml definition of the duckling function, please?

teo-chenglim commented 5 years ago

I don't think it will be helpful. But here it is.

provider:
  name: faas
  gateway: https://endpoint

functions:
  duckling:     
    lang: dockerfile
    handler: ./function             
    image: gcr.io/[project_name]/duckling:latest 
    labels:
      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 15
      com.openfaas.scale.factor: 50
    read_timeout: 300s  
    write_timeout: 300s  
    upstream_timeout: 300s 
    exec_timeout: 300s
    environment:
      read_timeout: 300s 
      write_timeout: 300s  
      upstream_timeout: 300s 
      exec_timeout: 300s

rgee0 commented 5 years ago

Thanks. It's looking increasingly as though this may be outside of OpenFaaS.

From the initial detail we can see that the function does return after 60-70 seconds. We can also see that the timeouts within OpenFaaS have been configured to lengths much greater than the 60-70 seconds.

I think there may be value in looking at how nginx is configured or anything in front of that, such as a load balancer, for example.

teo-chenglim commented 5 years ago

Hi @rgee0, yes. I tried ingress and nginx deployment but it doesn't help either

$ kubectl get ingress -n openfaas -o yaml | grep timeout
ingress.kubernetes.io/proxy-connect-timeout: 300s
ingress.kubernetes.io/proxy-read-timeout: 300s
ingress.kubernetes.io/proxy-send-timeout: 300s
ingress.kubernetes.io/upstream-fail-timeout: 300s
nginx.ingress.kubernetes.io/proxy-read-timeout: 600s
nginx.ingress.kubernetes.io/proxy-send-timeout: 600s

$ kubectl get deploy/caddy-tls -n openfaas -o yaml | grep -i timeout
    ingress.kubernetes.io/proxy-connect-timeout: 300s
    ingress.kubernetes.io/proxy-read-timeout: 300s
    ingress.kubernetes.io/proxy-send-timeout: 300s
    ingress.kubernetes.io/upstream-fail-timeout: 300s
    nginx.ingress.kubernetes.io/proxy-read-timeout: 600s
    nginx.ingress.kubernetes.io/proxy-send-timeout: 600s

teo-chenglim commented 5 years ago

Also try caddy timeouts

$ kubectl exec -n openfaas -it caddy-tls-95f8456b6-fgsw5 -- ash
/www # ps auxf
PID   USER     TIME   COMMAND
    1 root       0:02 caddy -agree --conf /Caddyfile
   16 root       0:00 ash
   20 root       0:00 ps auxf
/www #
/www # cat /Caddyfile 
:80 {
    status 200 /healthz
    basicauth /system {$ADMIN_USER} {$ADMIN_PASSWORD}
    basicauth /ui {$ADMIN_USER} {$ADMIN_PASSWORD}
    proxy / gateway:8080 {
            transparent
        }
    timeouts 300s ### added this!
    errors stderr
    tls off
}
/www #

teo-chenglim commented 5 years ago

caddy lb is from @stefanprodan 's repo, also open an issue there https://github.com/stefanprodan/openfaas-gke/issues/6

teo-chenglim commented 5 years ago

Found GKE default load balancer is 30 seconds. It is not Kubernetes nor Openfaas!

Thank you @alexellis @rgee0 @martindekov

https://cloud.google.com/load-balancing/docs/backend-service#create-backend-service

Edited the load balancer timeout on GCP.

rgee0 commented 5 years ago

Derek lock

alexellis commented 5 years ago

Hi,

It looks like GCP has a 30 second limit on their LoadBalancer as you identified - https://cloud.google.com/load-balancing/docs/https/#timeouts_and_retries

Can you check that you've configured this correctly and reach out to Google support if om doubt?

One thing you can try to test your settings is to use kubectl port-forward and bring the gateway to your local machine to see if the timeout issue is with your GKE LB or the configuration in OpenFaaS.

You could also test locally with minikube. Hope these suggestions may be useful to you.

Alex

openfaas / faas

Support: user query on how to set timeouts #1016

Expected Behaviour

Program running inside Kubenetes pods

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Inside of containers

Using curl is still timeout

Watch dog version