kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.5k stars 8.26k forks source link

GRPC GOAWAY #11118

Open rcng6514 opened 8 months ago

rcng6514 commented 8 months ago

What happened:

We have an application with GRPC streams working on GKE using an Ingress Cluster. We have a use case where we want to open a long lived grpc stream between my GRPC server(GKE) and Client should send data every second for infinite period of time. To achieve this use case I have designed my code in a way, that I never call OnCompleted method from GRPC Server java implementation, so that my stream remains open for infinite period of time. When I call my grpc method from Client, data transfer starts successfully for some time and streams run fine. However after few minutes(infrequent intervals) connection is automatically terminated giving below error message:- UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR Time period for this error is not fixed, however this occurs after around 5 minutes of success data transfer between the client and the server(GKE). We have tried various properties and timeouts to increase the longevity for streams (attached below is annotations attempted) however, we haven't found anything concrete on it.

Below is the ingress configuration annotations we are using

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/client-body-timeout: "true"
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    nginx.ingress.kubernetes.io/limit-connections: "1000"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/server-snippet: keepalive_requests 10000; http2_max_requests
      500000; keepalive_timeout 3600s; grpc_read_timeout 3600s; grpc_send_timeout
      3600s; client_body_timeout 3600s;
    nginx.ingress.kubernetes.io/service-upstream: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/upstream-vhost: example.example-app.svc.cluster.local
    example.io/contact: example@example.com
  creationTimestamp: "2024-02-01T18:13:19Z"
  generation: 3
  labels:
    application: example
  name: example
  namespace: example-app
  resourceVersion: "862855064"
  uid: fcef4054-e9db-4276-9ff0-e6e8b2d84301
spec:
  rules:
  - host: example.com
    http:
      paths:
      - backend:
          service:
            name: example-grpc
            port:
              number: 8980
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - example.com
  loadBalancer:
    ingress:
    - ip: 1.2.3.4

What you expected to happen:

Either annotations are respected or there is a misunderstanding in how we can make the above requirement possible

Not sure, we've exhausted all avenues

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

ingress-nginx-controller-7f9bf47c9f-kl4sp:/etc/nginx$ nginx -version
nginx version: nginx/1.21.6

Kubernetes version (use kubectl version): v1.27.9-gke.1092000

Environment:

Then reference manifests.yaml as a resource in Kustomization with env specific patches for naming/annotations + lables

How to reproduce this issue:

Vanilla ingress-nginx install in k8s. Simple GRPC app with long lived connection

This happens across multiple applications on the cluster and across multiple envs so its a specific instance of this. We've read and tried the following: https://kubernetes.github.io/ingress-nginx/examples/grpc/#notes-on-using-responserequest-streams However GOAWAYs continue to occur. Removing ingress-nginx and routing via NodePort, application held open the connection for 24+ hrs

--->

Anything else we need to know:

k8s-ci-robot commented 8 months ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 8 months ago

/remove-kind bug

/triage needs-information /kind support

strongjz commented 8 months ago

What type of GCP load balancer are you using? Can you try port forwarding to the ingress service and see if the issue persists?

rcng6514 commented 8 months ago

What type of GCP load balancer are you using? Can you try port forwarding to the ingress service and see if the issue persists?

Client -> TCP L4 -> NGINX Ingress Controller -> App

With:

Client -> TCP L4 -> App

Connection holds open for 24 hrs

longwuyuan commented 8 months ago

@rcng6514 the comment from @strongjz is seeking info on

client --> [port-forward-to-svc-created-by-controller] --> app

rcng6514 commented 7 months ago

Morning, tested with TCP L4 LB removed and port forwarded to ingress-controller k8s svc. We still observe the same behaviour and connections are issued a GOAWAY after 5-10 minutes.

  • Write real practical step by step instruction, including a example app image url, that readers can copy/paste from and reproduce on a minikube or kind cluster

@longwuyuan this isn't straightforward to achieve as the app contains IP that we'd need to strip from the image which will take considerable time. We'll start the process on this but will take time so was hoping to at least start the conversation on this.

longwuyuan commented 7 months ago

@rcng6514 , thanks, Would you know if there is a chart on artifacthub.io or a image in hub.docker.com that can be used for reproduce ?

github-actions[bot] commented 6 months ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

rcng6514 commented 6 months ago

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.9.5/deploy/static/provider/cloud/deploy.yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: grpc
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc
  namespace: grpc
  labels:
    app: grpc
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grpc
  template:
    metadata:
      labels:
        app: grpc
    spec:
      containers:
      - name: grpc
        image: docker.io/rcng1514/server
        ports:
        - containerPort: 8443
---
apiVersion: v1
kind: Service
metadata:
  name: grpc
  namespace: grpc
spec:
  selector:
    app: grpc
  ports:
    - protocol: TCP
      port: 8443
      targetPort: 8443
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc
  namespace: grpc
  annotations:
    # use the shared ingress-nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
  ingressClassName: nginx
  rules:
  - host: example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grpc
            port:
              number: 8443
  tls:
  - hosts:
    - example.com
    secretName: grpc
---
apiVersion: v1
kind: Secret
metadata:
  name: grpc
  namespace: grpc
type: kubernetes.io/tls
data:
  tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURzVENDQXBtZ0F3SUJBZ0lVZFg2RlVLem5vUTZXT0dndmNCOW9jVm1xSHVvd0RRWUpLb1pJaHZjTkFRRUwKQlFBd2ZERUxNQWtHQTFVRUJoTUNWVk14RVRBUEJnTlZCQWdNQ0U1bGR5QlpiM0pyTVJFd0R3WURWUVFIREFoTwpaWGNnV1c5eWF6RVlNQllHQTFVRUNnd1BSWGhoYlhCc1pTQkRiMjF3WVc1NU1Rc3dDUVlEVlFRTERBSkpWREVnCk1CNEdDU3FHU0liM0RRRUpBUllSWVdSdGFXNUFaWGhoYlhCc1pTNWpiMjB3SGhjTk1qUXdOREF6TVRjeE5UVXkKV2hjTk1qVXdOREF6TVRjeE5UVXlXakI4TVFzd0NRWURWUVFHRXdKVlV6RVJNQThHQTFVRUNBd0lUbVYzSUZsdgpjbXN4RVRBUEJnTlZCQWNNQ0U1bGR5QlpiM0pyTVJnd0ZnWURWUVFLREE5RmVHRnRjR3hsSUVOdmJYQmhibmt4CkN6QUpCZ05WQkFzTUFrbFVNU0F3SGdZSktvWklodmNOQVFrQkZoRmhaRzFwYmtCbGVHRnRjR3hsTG1OdmJUQ0MKQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFMbWhELzFqYzlTTG1oVjhtdjE2RDN6aApaTmxYdzlZdndIOWJIaitpV3Y5NDFsbTZqK0NVM3dNT3FpSloyZjJ6ZGtobk5uK3RmUTJhSXFNRlgrdm9zdkhZClNRV2lPMWRTc2EzQTJSZGJQd0V5QzV3bHh1ZUVtZE5vWWZtdHlzSkZSWk9nSkpYU01nelhOMGV3R2FJc1FEazgKZ29vZzV2cFN1OFdJbVNUMDJsUlZRV3FtZklzMSs1WFRhVjB0TXlmWFRFZVoxQXJ0cFZIdk5iekhLS3R4ZFZNQQp5dGx6K3U5Y1JZVzNGeTVoQS91VjFUTXdERzYrWWFVR0FJRzJidTk4TVg3Sk9iVWFtRUJDZnZTM2VWOWQzcXlpCnVualRlcnVlTi9KRmxhc2dpeVc2WmZyOWtobXBsUFl1d0Q5NkUyT2NHK2lzL0FUU1FEV2xJL2ZNYnFsWWJlY0MKQXdFQUFhTXJNQ2t3SndZRFZSMFJCQ0F3SG9JTFpYaGhiWEJzWlM1amIyMkNEM2QzZHk1bGVHRnRjR3hsTG1OdgpiVEFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBaVAwMTI0RFhvWnZnNmxsYW9GSEErWHdwZW1lbGFhaHpyTlVTCk1EbU10MWJ1U2ZKNkNmNkVTTlV1K1pISEI0OFFWd0pKWGxLaTJqVS9acHVvWDlqK0h6TmhnWHhNbEFJc2gyeUMKaTJubUFDOHcrU0hWOTgrRFJESW9YVHNDamRxSWRnSCtWelhzZjFWSkRmeUlhc1JsZGtGNmJDVUdsM0RjUGpkKwpId0VIN0NCZUo5d2lkQmxPRUdveWFDQW12WTJtd1huK216TnRSUXpCYTlWSEo2S1dvWmhCYjN3SXFnSFRTZ21FCnVVRHM2Sm4rcmVlU2FGajVYQTVuMVBKUjgvMXFKMEordk5rZ3IwZ3ZKa1gxT295OVY4YzJDNStLTkU3T3Q2NGQKekJJTTZrY29oMGZXQmIzdWZ1cUwrMU1qNU5HWXVPRFdxdVhSek15TkQwaE9HQWpwRVE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  tls.key: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQzVvUS85WTNQVWk1b1YKZkpyOWVnOTg0V1RaVjhQV0w4Qi9XeDQvb2xyL2VOWlp1by9nbE44RERxb2lXZG45czNaSVp6Wi9yWDBObWlLagpCVi9yNkxMeDJFa0ZvanRYVXJHdHdOa1hXejhCTWd1Y0pjYm5oSm5UYUdINXJjckNSVVdUb0NTVjBqSU0xemRICnNCbWlMRUE1UElLS0lPYjZVcnZGaUprazlOcFVWVUZxcG55TE5mdVYwMmxkTFRNbjEweEhtZFFLN2FWUjd6VzgKeHlpcmNYVlRBTXJaYy9ydlhFV0Z0eGN1WVFQN2xkVXpNQXh1dm1HbEJnQ0J0bTd2ZkRGK3lUbTFHcGhBUW43MAp0M2xmWGQ2c29ycDQwM3E3bmpmeVJaV3JJSXNsdW1YNi9aSVpxWlQyTHNBL2VoTmpuQnZvclB3RTBrQTFwU1AzCnpHNnBXRzNuQWdNQkFBRUNnZ0VBTWkweU9Fa1F2MHc1QzBQU1ZXQVFIYTZEWnlpTkhERnVORDY2RDNOZ2E1d0wKUE5mc0drWERmbjBSU2hYRmtnbFhtTHlsZzUrdXBPV2NKVHJIc2VvRnJNL005VVBrREhlaTVaZXlWdGpvVC9kcQpJZndvSnQ2MkFlbytTWkpMczNXc0YvcDZ5VEMzTExka0R2R3dEQ0V2L3dpM05JVXVTazNneWNWaHVCYWppWlhICnplSHZGM0dVRFlFcGNuMzVXcG9FV3hyUkVUSjFXUVN4NFVveVlZeUptSHBDUlNYSklna05jTHU1Y1dmKzY4c0YKME94K05JajJqQ3N3SjNScS9PaGlEMXRMcTdRT1pweDAxM1NLSUIrT0YrNjZTL3F4eDIzeTh2Vm5nRkZQWEVMNwo3YkJzcXA1VXZwVy9XK2RPWVhrNWp6QXl1Ty9uMGZNU3dqNi9CeC9KMlFLQmdRRHNIQU5NTmhkZyt5N1kwa285CkdmSW5MeWdXVFNKNjUzQVNGU3pNTm10eVYwQlh5RGNaK3pnQXpPOWw2eUpnNmJRQ1dQRWc4amZ4R2dFSnBxUncKS3JVTTdhTFREUUFWcHBmRDRaQklmY1hzdmJwc3EwUmptMW9mN0hVKzF1MzJHY3J1YVhMYWxpekhuMUg0UzYyZAowUXZjQVIyYUtEQW9DaWR4SUZuNnhQMTdGUUtCZ1FESlJHR0p3N1FmSG1zdUpvU2dobGJOaVBGdGtSNHQwYzV5CnNBYmRQNVd5TjJTc3RPVVpYNEtaRDZTUE50TXNwWTdaK2tkOHlvZUZzb3Y1d0VqRnpFbDkyV1puZXZvWVVWZHgKWStvVlpuWC9GMUNxZTAzR2NiT1QvQ2ZLU0QzNWFrdXcxN20wMnFDVDNtZnVTOFJWYkJKV1d2K1loelE5dnFJSQpYMlVqclJ5VUN3S0JnQUx3bGxuc2tuM3lvckt3YTV3M0pueTJhWmxkZklCclFVbjRXWVp4WndVVmNRZW14b2pjClIrWTZwd0J0M1ErMzJUWHVSWkpUY2I3ZXhBU0t2cUZtNXJveWUwU0ZkT3JRR0RPb0sxTzd2U3NsY1p6SXhTRTQKWGZibnlzM3RmeWtCU1RXT3VvOWVMMUNNKzBoTUtPMCtIUmV3Szk0dmdlbjl0bUFDTnh5WU4wL0JBb0dCQUxKRApESmoyYTB5OHBuV2p6QWhacy93cmRKcDAwK1FGVmZNaWtaSFl4WCtwckZPRGpQN2lKMHZtSFB4enRLcHdvSXZVCkx3a0tZT283NzlwdlFvVmVvU0VFTXIwb29PWjA5UndMUU1OZmt0Y3pFVkZPRU43WXloTWlYU08reEpWcVhrdnQKWmlBWEcrNmNLRFZaaWpXV21NOC9uZTY4b2JxbVkrRkNqTlFDZWJOdEFvR0FWdFM1SkY3VkhvSHB5RWJjSWY1UgpiR25Ud0RxTnFrMjJwRjR3YkdrYlVXQXFVTWNjc011WFcxZ3pKOFUvSm1lbFJSKzBJb0x5TWdaOXBBaFdLd3hjCmQySXdJSXhXTDI4RlNNREJwZ0VWQmNyZk1vMnBrZHAwZEpzaTBEbm11Q25ocE9LVktycVptcE1IWjRmVVRjRHgKUHpEajB0K0hvRnI4VVR0VEVyZEpSTmc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K

I'm observing GOAWAY after ~5-10 minutes, shorter periods dependent on NGINX load

rcng6514 commented 6 months ago
...
Hello Client recieved at time 2024-04-05 08:10:04+00:00
Hello Client recieved at time 2024-04-05 08:10:09+00:00
Exception in thread "main" io.grpc.StatusRuntimeException: UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR
        at io.grpc.Status.asRuntimeException(Status.java:535)
        at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:648)
        at io.grpc.examples.helloworld.TestClient.main(TestClient.java:21)
longwuyuan commented 6 months ago

@rcng6514 thanks

github-actions[bot] commented 5 months ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

rcng6514 commented 5 months ago

Apologies for the delay, we've since upgraded the controller so we're now running v1.10.1. Still seem to be hitting this even with timeouts set high:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/configuration-snippet: |
      grpc_read_timeout 3600s;
      grpc_send_timeout 3600s;
    nginx.ingress.kubernetes.io/limit-connections: "1000"
    nginx.ingress.kubernetes.io/service-upstream: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/upstream-vhost: hellogrpc.example.svc.cluster.local
  labels:
    app.kubernetes.io/name: hellogrpc
    application: example
  name: hellogrpc
  namespace: example
spec:
  ingressClassName: nginx
  rules:
  - host: example.com
    http:
      paths:
      - backend:
          service:
            name: hellogrpc
            port:
              number: 8443
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - example.com
    secretName: example-tls

This only seems to be on some of our higher utilised GKE clusters. Client receives response consistently for ~5 minutes followed by:


{
  "message": "Hello ",
  "dateTime": "2024-05-23 12:41:12+00:00"
}
ERROR:
  Code: Unavailable
  Message: closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR```
longwuyuan commented 5 months ago

If you provide ;

then anyone can try to reproduce on a minikube or a kind cluster.

In the ingress yaml posted above, the use of annotations like limit-connections and upstream-vhost, just throw UN-necessary complications for testing long-living gRPC stream, I would not use that.