skipper ingress failing after sometime

aupadh12 commented 2 years ago

Hi Team,

I have recently upgraded my nginx ingress controller to 1.1.2 and this fixes the validation webhook hook. But now, when I am trying to open the application, I notice that skipper ingress gets terminated with OOMKILLED reason and endpoint is also deleted giving nginx 503/504 error. After sometime, I notice that the skipper starts running again. But again when I try to access my application, same events mentioned above are repeated. This was not happening when I was running old nginx ingress controller. Can you please let me know what is going on here?

Below are the logs of skipper ingress:

[APP]time="2022-03-05T11:03:09Z" level=info msg="Expose metrics in codahale format" [APP]time="2022-03-05T11:03:09Z" level=info msg="register sigterm handler" [APP]time="2022-03-05T11:03:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:03:09Z" level=info msg="Run secrets path refresher every 1m0s, but update once first" [APP]time="2022-03-05T11:03:09Z" level=info msg="debug listener on :9922" [APP]time="2022-03-05T11:03:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:03:09Z" level=info msg="support listener on :9911" [APP]time="2022-03-05T11:03:09Z" level=warning msg="RouteGroups CRD is not installed in the cluster.\nSee: https://opensource.zalando.com/skipper/kubernetes/routegroups/#installation" [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings, reset, route: kube_argocdargocd_app_itx_age_serverargocd_itx_age_xyz_com_argocd_app_itx_age_server: Host(/^argocd-itx-age[.]xyz[.]com$/) && PathSubtree(\"/\") -> \"http://100.64.179.131:8080\"" [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings, reset, route: kube_kube_systemngingx_to_skipper_ingressshinyproxy_itx_age_xyz_com_skipper_ingress: Host(/^shinyproxy-itx-age[.]xyz[.]com$/) && PathSubtree(\"/\") -> status(502) -> inlineContent(\"no endpoints\") -> " [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings, reset, route: kube_shinyproxysp_shinyproxy_hpc_ing_b3cf234ea2fdeff89bf8a673d5766b91d355ebb6shinyproxy_itx_age_xyz_com__sp_shinyproxy_hpc_svc_b3cf234ea2fdeff89bf8a673d5766b91d355ebb6: Host(/^shinyproxy-itx-age[.]xyz[.]com$/) && True() -> setRequestHeader(\"X-ShinyProxy-Instance\", \"b3cf234ea2fdeff89bf8a673d5766b91d355ebb6\") -> setRequestHeader(\"X-ShinyProxy-Latest-Instance\", \"b3cf234ea2fdeff89bf8a673d5766b91d355ebb6\") -> appendResponseHeader(\"Set-Cookie\", \"sp-instance=b3cf234ea2fdeff89bf8a673d5766b91d355ebb6; Secure; Path=/\") -> appendResponseHeader(\"Set-Cookie\", \"sp-latest-instance=b3cf234ea2fdeff89bf8a673d5766b91d355ebb6; Secure; Path=/\") -> \"http://100.64.242.101:8080\"" [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings, reset, route: kubehealthz: Path(\"/kube-system/healthz\") && SourceFromLast(\"10.0.0.0/8\", \"192.168.0.0/16\", \"172.16.0.0/12\", \"127.0.0.1/8\", \"fd00::/8\", \"::1/128\", \"100.0.0.0/8\") -> status(200) -> disableAccessLog(200) -> " [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings received" [APP]time="2022-03-05T11:03:09Z" level=info msg="route settings applied" [APP]time="2022-03-05T11:03:09Z" level=info msg="proxy listener on :9999" [APP]time="2022-03-05T11:03:09Z" level=info msg="TLS settings not found, defaulting to HTTP" [APP]time="2022-03-05T11:04:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:04:13Z" level=info msg="diff taken, inserts/updates: 1, deletes: 0" [APP]time="2022-03-05T11:04:13Z" level=info msg="route settings, update, route: kube_kube_systemngingx_to_skipper_ingressshinyproxy_itx_age_xyz_com_____skipper_ingress: Host(/^shinyproxy-itx-age[.]xyz[.]com$/) && PathSubtree(\"/\") -> \"http://100.64.146.32:9999\"" [APP]time="2022-03-05T11:04:13Z" level=info msg="route settings received" [APP]time="2022-03-05T11:04:13Z" level=info msg="route settings applied" [APP]time="2022-03-05T11:05:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:06:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:07:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token" [APP]time="2022-03-05T11:08:09Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"

The logs from nginx ingress controller : 2022/03/05 11:00:34 [error] 4912#4912: 470546 upstream prematurely closed connection while reading response header from upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" 2022/03/05 11:00:34 [warn] 4912#4912: 470546 [lua] sticky.lua:189: balance(): failed to get new upstream; using upstream nil while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" 2022/03/05 11:00:34 [warn] 4912#4912: 470546 [lua] balancer.lua:335: balance(): no peer was returned, balancer: sticky_balanced while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" W0305 11:00:35.201227 6 controller.go:1094] Service "kube-system/skipper-ingress" does not have any active Endpoint. 2022/03/05 11:00:39 [warn] 4912#4912: 470546 upstream server temporarily disabled while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://0.0.0.1:80/", host: "shinyproxy-itx-age.xyz.com" 2022/03/05 11:00:39 [error] 4912#4912: 470546 upstream timed out (110: Operation timed out) while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://0.0.0.1:80/", host: "shinyproxy-itx-age.xyz.com" 100.64.171.163 - - [05/Mar/2022:11:00:39 +0000] "GET / HTTP/2.0" 504 562 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" 1076 29.094 [kube-system-skipper-ingress-http] [] 100.64.146.32:9999, 0.0.0.1:80 0, 0 24.096, 5.000 502, 504 dfcedb8379d93581c0e841fa579c6a2b 100.64.171.163 - - [05/Mar/2022:11:01:32 +0000] "GET / HTTP/2.0" 503 592 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" 90 0.000 [kube-system-skipper-ingress-http] [] - - - - 6204288c167241f54933c2928bbf487c 2022/03/05 11:02:54 [error] 4912#4912: 470546 upstream prematurely closed connection while reading response header from upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" 2022/03/05 11:02:54 [error] 4912#4912: 470546 connect() failed (111: Connection refused) while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" 2022/03/05 11:02:54 [error] 4912#4912: 470546 connect() failed (111: Connection refused) while connecting to upstream, client: 100.64.171.163, server: shinyproxy-itx-age.xyz.com, request: "GET / HTTP/2.0", upstream: "http://100.64.146.32:9999/", host: "shinyproxy-itx-age.xyz.com" 100.64.171.163 - - [05/Mar/2022:11:02:54 +0000] "GET / HTTP/2.0" 502 552 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" 27 23.528 [kube-system-skipper-ingress-http] [] 100.64.146.32:9999, 100.64.146.32:9999, 100.64.146.32:9999 0, 0, 0 23.528, 0.000, 0.000 502, 502, 502 76d0d4fdbea222a7d553bfbbfb9cb855 W0305 11:02:55.288677 6 controller.go:1094] Service "kube-system/skipper-ingress" does not have any active Endpoint.

Also, my skipper ingress files are as below:

ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ngingx-to-skipper-ingress
  namespace: kube-system
  annotations:
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: 5000m
    nginx.ingress.kubernetes.io/proxy-read-timeout: "420"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "420"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
    nginx.ingress.kubernetes.io/session-cookie-name: rout
    nginx.ingress.kubernetes.io/proxy-body-size: 5000m
    cert-manager.io/cluster-issuer: sectigo
    nginx.ingress.kubernetes.io/affinity: cookie
    nginx.org/websocket-services: "skipper-ingress"
    #kubernetes.io/ingress.class: "nginx"
spec:
  ingressClassName: nginx
  rules:
  - host: shinyproxy-itx-age.xyz.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: skipper-ingress
            port:
              name: http
  tls:
  - hosts:
    - shinyproxy-itx-age.xyz.com
    secretName: shinyproxy-age-tls-secret

skipper-service.yaml

kind: Service
apiVersion: v1
metadata:
  name: skipper-ingress
  namespace: kube-system
  labels:
    application: skipper-ingress
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "9911"
    prometheus.io/scrape: "true"
spec:
  type: ClusterIP #NodePort
  ports:
    #- name: skipper-ingress-port
      #port: 9999
      #targetPort: 9999
      #protocol: TCP
    - port: 9999
      targetPort: 9999
      protocol: TCP
      name: http
    - port: 9922
      targetPort: 9922
      protocol: TCP
      name: debug
  selector:
    application: skipper-ingress

skipper daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: skipper-ingress
  namespace: kube-system
  labels:
    application: skipper-ingress
    version: v0.13.44
    component: ingress
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      application: skipper-ingress
  template:
    metadata:
      name: skipper-ingress
      labels:
        application: skipper-ingress
        version:  v0.13.44
        component: ingress
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: skipper-ingress
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: skipper-ingress
        image: registry.opensource.zalan.do/teapot/skipper:v0.13.44
        ports:
        - name: ingress-port
          containerPort: 9999
        - name: debug-port
          containerPort: 9992
        #env:
          #- name: KUBE_NAMESPACE
             #valueFrom:
                #fieldRef:
                   #fieldPath: metadata.namespace
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 200m
            memory: 512Mi
        args:
          - "skipper"
          - "-kubernetes"
          - "-kubernetes-in-cluster"
          - "-kubernetes-path-mode=path-prefix"
          - "-kubernetes-https-redirect=false"
          - "-address=:9999"
          - "-wait-first-route-load"
          - "-proxy-preserve-host"
          - "-disable-metrics-compat"
          - "-debug-listener=:9922"
          - "-experimental-upgrade"
          - "-metrics-exp-decay-sample"
          - "-reverse-source-predicate"
          - "-lb-healthcheck-interval=3s"
          - "-access-log-json-enabled"
          - "-read-timeout-server=5m"
          - "-write-timeout-server=5m"
          - "-response-header-timeout-backend=5m"
          - "-timeout-backend=5m"
          - "-whitelisted-healthcheck-cidr=192.0.0.0/8"
        readinessProbe:
          httpGet:
            path: /kube-system/healthz
            port: 9999
          initialDelaySeconds: 60
          timeoutSeconds: 5
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 999
      imagePullSecrets:
      - name: xyz

aupadh12 commented 2 years ago

Hello,

I also noticed that whenever I try to access the shinyproxy application, skipper ingress pod restarts with the error:

Readiness probe failed: Get "http://100.64.144.197:9999/kube-system/healthz": read tcp 100.64.237.71:37248->100.64.144.197:9999: read: connection reset by peer

aupadh12 commented 2 years ago

Hi,

I also noticed that below the mount is using kube-api-access instead of skipper ingress secret.

Environment:  <none>
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k9rd4 (ro)

Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-k9rd4: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors:

I am not sure if this is an issue since its different and the place where the there is no issue actually has skipper ingress secret getting mounted. although I am not sure what is missing since the setup is almost the same.

LEDfan commented 2 years ago

Hi

This is a surprising issue. I just did some test with k8s 1.21, ingress 1.1.2 and I don't experience the issue you are describing. You mention the skipper pod gets killed because it goes out of memory (OOMKILLED). Typically this can have two reasons:

the pod tries to use more memory than it is allowed by its limit
the node has not enough resources and the kernel starts killing processes

It would be good to know the reason of the current issue (i.e. 1 or 2). However, since it seems you can easily reproduce the issue, I guess it will be reason 1. In that case you should be able to see that the process in the pod tries to allocate the memory in your monitoring (i.e. Prometheus). Can you verify this? If this is the case I would try to launch the daemonset with a bigger memory limit and see if the issue comes back.

Finally, you could try running the latest version of Skipper. Maybe they already solved a similar issue. (https://github.com/zalando/skipper/releases)

aupadh12 commented 2 years ago

Hello,

I upgraded my nodetype to m5.8xlarge and I still see this to be an issue. Strangely, on K8s version 1.20 and ingress 1.0.0, this is working fine even on m5.4xlarge nodetype. So, I am not sure this is related to the more memory. I will try using latest version then.

LEDfan commented 2 years ago

Hi

So indeed this is not expected behavior, therefore I think we should try to find the cause. I think the node type should not be so important here, but rather the memory limit on the pod itself. Did you try to increase that as well? Basically, I want to try out whether Skipper just needs a bit more memory and it will work fine or whether skipper contains a bug where it keeps allocating memory. E.g. if you set the limit to 1G, 2G, 3G and it still crashes, it's definitely a bug in skipper (which may or may not be already fixed).

aupadh12 commented 2 years ago

Hi @LEDfan ,

I was able to find the issue and fix the problem. Skipper ingress still does not support the new syntax of specifying ingressclassName. I have raised a feature request https://github.com/zalando/skipper/issues/1976 for including it. For now, I am using annotations for ingressclass on K8s version 1.21.5 and its working fine.

aupadh12 commented 2 years ago

Hello,

so there is a workaround to use the latest syntax of ingress. The same mention here: https://github.com/zalando/skipper/issues/1976

openanalytics / shinyproxy-operator

skipper ingress failing after sometime #28