Impossible to make websockets working on Knative with Google Cloud Run on GKE

sneko commented 4 years ago

What version of Knative?

GKE Cluster: v0.11.0-gke.9 According to https://cloud.google.com/run/docs/gke/release-notes the Knative should be at least: v0.5.0

Expected Behavior

Having inbound websockets to work.

Actual Behavior

Always getting this error when connecting:

WebSocket connection to 'ws://demo.XXXXXXX.net/ws' failed: Error during WebSocket handshake: Unexpected response code: 503

It simply does not work on my cluster "Google Cloud Run on GKE".

I know that it cannot work with Cloud Run fully-managed (with Google homemade stuff), that's sure and public information. But on "Cloud Run on GKE" it should work.

It has been mentioned by a Google person https://github.com/ahmetb/cloud-run-faq/issues/33#issuecomment-508969801 . And he pointed to an example (https://github.com/mchmarny/knative-ws-example) but I never succeeded to make it working 😢 --> I'm wondering if it has been tested with Cloud Run at the end?

Steps to Reproduce the Problem

I won't copy/paste all the content here to keep it "clean", you can find all my attempts here https://github.com/mchmarny/knative-ws-example/issues/2 . As you can see I tried with brand new clusters, different regions, different versions...

Since there is no additional settings from my side, I wonder if everything's good for you with websockets?

Thank you,

vagababov commented 4 years ago

There are e2e tests executed constantly for CR. So in general they work.

/cc @tcnghia

@sneko 503 might mean many things: problems with ingress setup, internal knative/cr issues.

When you deploy your service is it getting ready? kubctl get ksvc <svc name> -oyaml — can you paste the result here? What is the exact error text that comes with 503?

ZhiminXiang commented 4 years ago

Did you use Cloud Run on GKE or install Knative by yourself on GKE cluster?
If you use Cloud Run on GKE, could you check the log of istio-proxy container of pod istio-ingress-XXXX under gke-system namespace, and see if there is any related error log?
If you install Knative by yourself, could you check the log of istio-proxy container of pod istio-ingressgateway-XXX under istio-system namespace, and see if there is any related error log?

sneko commented 4 years ago

@vagababov I have no more detail about the error. Maybe I don't look at the right place? But the error I mentioned is in the browser console but also when looking at the Chrome network section:

Yes the service is ready, I'm able to access the example UI, that's just the websocket connection that cannot be established. Here are the service details:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"serving.knative.dev/v1alpha1","kind":"Service","metadata":{"annotations":{},"name":"kws","namespace":"demo"},"spec":{"runLatest":{"configuration":{"revisionTemplate":{"spec":{"container":{"env":[{"name":"RELEASE","value":"v0.1.6"},{"name":"KNOWN_PUBLISHER_TOKEN","valueFrom":{"secretKeyRef":{"key":"KNOWN_PUBLISHER_TOKEN","name":"kws"}}}],"image":"gcr.io/knative-samples/kws:latest","imagePullPolicy":"Always"}}}}}}}
    serving.knative.dev/creator: XXXX@YYY.com
    serving.knative.dev/lastModifier: XXXX@YYY.com
  creationTimestamp: "2020-05-11T18:08:29Z"
  generation: 1
  name: kws
  namespace: demo
  resourceVersion: "12486911"
  selfLink: /apis/serving.knative.dev/v1/namespaces/demo/services/kws
  uid: ad7eb598-a29f-4b10-9945-52ebb46d4d5f
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
      - env:
        - name: RELEASE
          value: v0.1.6
        - name: KNOWN_PUBLISHER_TOKEN
          valueFrom:
            secretKeyRef:
              key: KNOWN_PUBLISHER_TOKEN
              name: kws
        image: gcr.io/knative-samples/kws:latest
        imagePullPolicy: Always
        name: user-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100
status:
  address:
    url: http://kws.demo.svc.cluster.local
  conditions:
  - lastTransitionTime: "2020-05-11T18:08:36Z"
    status: "True"
    type: ConfigurationsReady
  - lastTransitionTime: "2020-05-13T17:20:54Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2020-05-13T17:20:54Z"
    status: "True"
    type: RoutesReady
  latestCreatedRevisionName: kws-gn7sc
  latestReadyRevisionName: kws-gn7sc
  observedGeneration: 1
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: kws-gn7sc
  url: http://kws.demo.example.com

@ZhiminXiang

I use Cloud Run on GKE

So here are the logs when I load the UI and that it tries to connect to the websocket endpoint then:

{
"httpRequest": {
"requestMethod": "GET",
"requestUrl": "http://kws.demo.example.com/",
"requestSize": 0,
"status": 200,
"responseSize": 1927,
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
"remoteIp": "10.52.3.1",
"serverIp": "10.52.3.3:80",
"protocol": "HTTP/1.1"
},
"upstream_cluster": "outbound|80||cluster-local-gateway.gke-system.svc.cluster.local",
"response_flag": "-",
"logging.googleapis.com/trace": "a55416139a3006a678eada577bed395a",
"latencyMs": 4460
}
{
"httpRequest": {
"requestMethod": "GET",
"requestUrl": "http://kws.demo.example.com/static/img/favicon.ico",
"requestSize": 0,
"status": 200,
"responseSize": 1150,
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
"remoteIp": "10.52.3.1",
"serverIp": "10.52.3.3:80",
"protocol": "HTTP/1.1"
},
"upstream_cluster": "outbound|80||cluster-local-gateway.gke-system.svc.cluster.local",
"response_flag": "-",
"logging.googleapis.com/trace": "74111fd7a12dd61f03c020eb0ee08557",
"latencyMs": 5
}
{
"httpRequest": {
"requestMethod": "GET",
"requestUrl": "http://kws.demo.example.com/ws",
"requestSize": 0,
"status": 503,
"responseSize": 95,
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
"remoteIp": "10.52.3.1",
"serverIp": "10.52.3.3:80",
"protocol": "HTTP/1.1"
},
"upstream_cluster": "outbound|80||cluster-local-gateway.gke-system.svc.cluster.local",
"response_flag": "UC",
"logging.googleapis.com/trace": "3df621b14aeb0314e57b995a1021f898",
"latencyMs": 1002
}

When reaching /ws endpoint we can see the status code 503, does it help :s ?

I'm a bit surpized as I mentioned in the issue https://github.com/mchmarny/knative-ws-example/issues/2 that both my own websocket server and the one from the repo example don't work. Have you already tried it in the past?

Thank you for taking time to help 👍 ,

vagababov commented 4 years ago

Something tells me that the 503 comes from the app. Can you add logging to your /ws handler to log something before it reaches WS handler, to see if reaches the handler. If it does, then the problem is within the app.

ZhiminXiang commented 4 years ago

/area networking

ZhiminXiang commented 4 years ago

/assign @ZhiminXiang

sneko commented 4 years ago

@vagababov , the https://github.com/mchmarny/knative-ws-example sample has some broken dependencies. I tried to update the code/deps to make it working with the additional logs you asked but I didn't succeeded yet 😢

Sorry for not being able to provide more information for now. But... you think the 503 comes from the app:

The https://github.com/mchmarny/knative-ws-example sample has been tested by several people months ago and it was working
On the other side, my own GraphQL API that does not share any code with the above sample really work locally without Google Cloud Run on GKE

I will do my best to investigate on the applications... but I really doubt it comes from here.

Could you please give a try to the known sample? Since there is a Docker image built, you can deploy it within 30 seconds.

Otherwise, could you point me to some websocket samples that works on Google Cloud Run on GKE?

I'm totally lost with working websocket servers outside Cloud Run on GKE, and not working ones once on Cloud Run on GKE whereas it's brand new clusters 😞

Thank you,

sneko commented 4 years ago

@vagababov , just to add another information,

in https://github.com/mchmarny/knative-ws-example/issues/2 I described I was able to use my own GraphQL API (with websockets) by doing a port-forward directly on the pod within Google Cloud Run on GKE.

Just to follow up your thinking about the problem would come from the app: I tried the same with the known sample by doing: kubectl port-forward -n demo kws-gn7sc-deployment-6c895f86fb-k9k7g 8050:8080

And when I go on the UI onto http://localhost:8050 and that I run:

curl -H "Content-Type: text/plain" \
     -H "ce-specversion: 0.2" \
     -H "ce-type: github.com.mchmarny.knative-ws-example.message" \
     -H "ce-source: https://github.com/mchmarny/knative-ws-example" \
     -H "ce-id: $(uuidgen)" \
     -H "ce-time: $(date +%Y-%m-%dT%H:%M:%S:%Z)" \
    -H "ce-token: $KNOWN_PUBLISHER_TOKEN" \
     -X POST --data "My sample message" \
    http://localhost:8050/v1/event

I can see in real time the message on the UI. So I can confirm the WS sample application, even if 1 year old, is still working great.

So, the problem "should" come from either Istio, Knative, or the cluster settings (but for the latter, I tried brand new clusters).

Thank you,

vagababov commented 4 years ago

We have integration tests https://github.com/knative/serving/blob/master/test/e2e/websocket_test.go Would you care to run it against your cluster?

Otherwise I will need to find some time to try Mark's example

sneko commented 4 years ago

I'm definitely willing to try it but I encoutered some troubles to test:

I upgraded Go to v1.14 and run go mod tidy
Read https://github.com/knative/serving/blob/master/test/README.md#running-end-to-end-tests and https://github.com/knative/serving/blob/master/test/README.md#environment-requirements to see if something important is needed
Run go get -u k8s.io/test-infra/kubetest as described (even if not sure it's for me since I just want to run a specific test, not the whole e2e-tests.sh script

I run go test websocket_test.go -v but I get the following:

# command-line-arguments [command-line-arguments.test]
./websocket_test.go:50:15: undefined: connect
./websocket_test.go:86:17: undefined: connect
./websocket_test.go:129:13: undefined: Setup
./websocket_test.go:156:13: undefined: Setup
./websocket_test.go:177:12: undefined: waitForActivatorEndpoints
FAIL    command-line-arguments [build failed]
FAIL

Do you know what is missing?

On the other side, I configured my kubectl CLI to targets the desired running cluster, is there anything special to do in addition?

Thank you,

EDIT: oh, those functions are just in the same package but in a different file, here the command to make it working: go test e2e.go websocket.go websocket_test.go -v

vagababov commented 4 years ago

or you can just do go test ./test/e2e/. -run=TestWebsocket -v -tags="e2e" from the root.

vagababov commented 4 years ago

You also need to upload the test image using ./test/upload-test-images.sh (though you might want to delete all the other images, that you don't need later).

sneko commented 4 years ago

Thank you for pointing about uploading test images.

When running the test command go test -v -tags=e2e -count=1 ./test/e2e -run ^TestWebSocket$ I get:

2020/05/22 21:29:59 Using '1590175799322706000' to seed the random number generator
=== RUN   TestWebSocket
=== PAUSE TestWebSocket
=== CONT  TestWebSocket
    TestWebSocket: service.go:98: Creating a new Service service web-socket-qlhqzmgw
    TestWebSocket: crd.go:35:  resource {<nil> <nil> <*>{&TypeMeta{Kind:,APIVersion:,} &ObjectMeta{Name:web-socket-qlhqzmgw,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},} {{&ObjectMeta{Name:,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},}} {<nil>}} {{0 <nil>} { } {<nil> <nil> <nil>}}} <nil>}
    TestWebSocket: service.go:113: Waiting for Service to transition to Ready. service web-socket-qlhqzmgw
    TestWebSocket: service.go:118: Checking to ensure Service Status is populated for Ready service
    TestWebSocket: service.go:124: Getting latest objects Created by Service
    TestWebSocket: service.go:127: Successfully created Service web-socket-qlhqzmgw
    TestWebSocket: websocket.go:64: Connecting using websocket: url=ws://35.187.190.242/, host=web-socket-qlhqzmgw.serving-tests.example.com
    TestWebSocket: websocket.go:67: WebSocket connection established.
    TestWebSocket: websocket_test.go:142: Sending message "Hello, websocket" to server.
    TestWebSocket: websocket_test.go:142: Message sent.
    TestWebSocket: websocket_test.go:142: Received message "Hello, websocket" from echo server.
--- PASS: TestWebSocket (17.66s)
PASS
ok      knative.dev/serving/test/e2e    18.646s

So conclusion, yes the e2e test is working.

Since both Mark's example and my own Websocket API has the 503 error without sharing any common library to make the server, I'm wondering:

what the e2e tests really do to connect to the Knative service hosted on the cluster? Do they open a new ingress route (with a public IP)?

I'm trying to determine what could be the difference hmmm...

On the other side Mark answered https://github.com/mchmarny/knative-ws-example/issues/2#issuecomment-632797229 that I should probably also test another example. Will take a look later tonight what can I do with it 🤞

vagababov commented 4 years ago

Sounds good, please keep us posted. 503 usually indicates some istio programming problem. But may be we have a bug that we haven't hit before :)

sneko commented 4 years ago

I just tested his example, and I'm also getting the 503 issue:

Error during WebSocket handshake: Unexpected response code: 503

So... I'm totally lost haha! At least it seems it's not from Knative and the 3 applications tested.

If you could give a try to the service.yaml from https://github.com/knative-sample/websocket-chat onto a "Cloud Run on GKE" cluster I would really appreciate. Just to know if something is wrong with me.

Note that to make my incoming traffic reaching the cluster, I set everything up at https://console.cloud.google.com/run/domains and I use the option Add service domain mapping (and not Add cluster default domain). Could this be the issue hmmm?

Thank you,

ZhiminXiang commented 4 years ago

@sneko Could you try to delete your DomainMapping and try websocket again? I will try those two websocket examples during the weekends.

sneko commented 4 years ago

@ZhiminXiang , if I don't use the DomainMapping, how I can reach my services to test the websockets? Since I'm on Cloud Run on GKE (not the fully-managed one), I don't have generated URL by default to route to my services (whereas on the "fully-managed" Cloud Run, all services get a unique Google URL like http://xxxxxxx.run.app even without any DomainMapping).

ZhiminXiang commented 4 years ago

@sneko you can use the domain like xip.io as your default custom domain for all your Knative Services. Those domains are publicly accessible. See doc https://cloud.google.com/run/docs/gke/default-domain

vagababov commented 4 years ago

I works for me:

>curl -v http://websocket-chat.default.<>.xip.io/ws
*   Trying <>:80...
* TCP_NODELAY set
* Connected to websocket-chat.default.<>.xip.io (<>) port 80 (#0)
> GET /ws HTTP/1.1
> Host: websocket-chat.default.<>.xip.io
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< content-length: 12
< content-type: text/plain; charset=utf-8
< date: Fri, 22 May 2020 21:52:51 GMT
< sec-websocket-version: 13
< x-content-type-options: nosniff
< x-envoy-upstream-service-time: 2
< server: istio-envoy
<
Bad Request
* Connection #0 to host websocket-chat.default.<>xip.io left intact

Obviously it's 400 since I just requested regular http, but nonetheless the routing works

sneko commented 4 years ago

Good news for me too @ZhiminXiang @vagababov , just tried it! It works with the "Websocket Chat" example with a custom websocket-chat.demo.XXXXXXX.xip.io.

So the issue seems to come from "adding a service domain mapping".

Strange... hmmmm

EDIT: I can also confirm it works with Mark's sample (https://github.com/mchmarny/knative-ws-example). I guess it will also work with my own API then. I can manage my API with those custom URLs until the "issue/reason" is found about DomainMapping.

Thank you for investigating with me!

ZhiminXiang commented 4 years ago

Thanks @sneko @vagababov for the confirmation. I will investigate DomainMapping from Google side.

tcnghia commented 4 years ago

cc @julz

We may want to take a look at this for our vanity-domain work as well.

sneko commented 4 years ago

@ZhiminXiang any news on that?

On my side I have an issue, I'm not sure it's related to the workaround we discussed above.

In short, before all above messages I was using auto-TLS to generate certificates. But since I enabled reaching the cluster through custom URLs (${service}.${namespace}.domain.com) to unlock the main issue, my services that were using DomainMapping were no longer accessible through HTTPS.

I get this error in my browser:

Failed to load resource: net::ERR_CONNECTION_REFUSED
Uncaught (in promise) Error: Network error: Failed to fetch

When running kubectl get kcert --all-namespaces I saw all certificates as READY so it didn't make any sense to me 😢 . I tried to restart the IstioIngress but it didn't change anything.

I went to https://console.cloud.google.com/run/domains and tried deleting just 1 domain mapping, re-add it a new time, and all services were a new time accessible through HTTPS.

Except that... after 2-3 days for no reason the services are no longer reachable through HTTPS a new time. I had to renew the described steps (delete/re-add) to make it working again on all services.

Since it happens from the time I start using the custom URL, I'm wondering if you already encountered that?

Thank you,

sneko commented 4 years ago

Note that this "potential conflict" put almost all my services in an unstable state with an infinite loader:

It implies my CI/CD is no longer able to deploy new versions since old ones are stuck. That was not happening before and it makes my cluster totally unusable...

I don't understand what I'm doing wrong, I'm just using basic Cloud Run on GKE features 🤔

EDIT: The Kservices in the infinite loading state are READY=Unknown and Reason=IngressNotConfigured

EDIT2: after running kubectl logs controller-59b4bd959f-n5slf -n knative-serving

it seems the possible error is error: error roundtripping https://XXXX.YYYYY.domain.com:443: dial tcp 10.52.1.11:443: connect: connection refused. In my case I enabled the cluster domain mapping to bypass this thread issue to use WebSocket for a specific deployment. But I didn't give all DNS records for each possible service, just the one that needed the workaround. Any chance to avoid Knative looking for all services with cluster domain mapping?

ZhiminXiang commented 4 years ago

@sneko I was eventually able to figure out how to make websocket work with DomainMapping. To make it work, you need to apply the following yaml

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: allowconnect-cluser-local-gateway
  namespace: gke-system
spec:
  workloadSelector:
    labels:
      app: cluster-local-gateway
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      listener:
        portNumber: 80
        filterChain:
          filter:
            name: "envoy.http_connection_manager"
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager"
          http2_protocol_options:
            allow_connect: true

Here is the detailed explanation: https://stackoverflow.com/questions/63251231/google-cloud-run-custom-domains-do-not-work-with-web-sockets/63406952#63406952

ZhiminXiang commented 4 years ago

And for the issue in https://github.com/knative/serving/issues/7933#issuecomment-636706728 and https://github.com/knative/serving/issues/7933#issuecomment-636904467, I suspect it is not the problem of DomainMapping. Instead, it is the problem of the Knative Service that the DomainMapping points to as the Knative Service was not in the ready state.

Could you try to delete and recreate your knative service? If it still has this issue, you can try to restart your Istio pilot by following https://cloud.google.com/run/docs/gke/troubleshooting#services_report_status_of_ingressnotconfigured

brianschardt commented 4 years ago

@ZhiminXiang i get this error when adding that to the cloud run yaml file

Cannot find field: workloadSelector in message google.cloud.run.v1.ServiceSpec

where do i add the code you mentioned?

ZhiminXiang commented 4 years ago

@brianschardt you just need to apply the yaml separately, i.e. running the following command

cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: allowconnect-cluser-local-gateway
  namespace: gke-system
spec:
  workloadSelector:
    labels:
      app: cluster-local-gateway
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      listener:
        portNumber: 80
        filterChain:
          filter:
            name: "envoy.http_connection_manager"
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager"
          http2_protocol_options:
            allow_connect: true
EOF

Let me know if it does not work.

ZhiminXiang commented 4 years ago

Note that above yaml makes the websocket work with custom domain mapping. If you don't use custom domain mapping, websocket works with knative service directly.

brianschardt commented 4 years ago

@ZhiminXiang thanks i just saw that in the documentation. I was hoping there was a way to do this from the gcp web portal. Do you know if there is? sounds like there is not.

ZhiminXiang commented 4 years ago

@brianschardt currently there is no way from web UI. In the future, we should make the websocket out-of-box without manually applying it.

sneko commented 3 years ago

Hi @ZhiminXiang,

Sorry for the delay. By using your EnvoyFilter I succeeded reaching my APIs through the Google domain mappings with websockets! What a success héhé ❤️

I will after migrating all my frontends clean the messy workaround (that was done by using default cluster domain just for websocket endpoints, conflicting in SSL and so on...).

Thanks for taking time on this issue, keep us updated when Google officially applies this patch in their managed clusters please 👍

ZhiminXiang commented 3 years ago

@sneko sure, will post the update in this thread when the fix is shipped in Cloud Run. I am gonna close this issue at this point as there is no AI for it.

sneko commented 3 years ago

Hi @ZhiminXiang @vagababov ,

Since GCP Slack #cloud-run is almost dead I post here since it's really related.

I just noticed that my websockets were killed automatically after 5 minutes whereas I'm on the GKE platform. After investigating it comes from the request timeout Cloud Run leaves by default at 300 seconds.

Since my websocket connections should remain active forever (except for network failures), I would like to set them at a high duration (like a year, except for microservices updates that would imply a forced restart) but I noticed that:

From the Cloud Run UI: we are still limited to 900 seconds (15min)

It's mentioned at the bottom of https://cloud.google.com/run/docs/configuring/request-timeout

Note that timeouts greater than 15 minutes are a Beta feature and require to use the gcloud beta run command.

From the gcloud CLI: we are limited to 24h (note that it works for the GA CLI but also the Beta one)

When trying 24h + 1 second the CLI API rejects the service deployment. Could you please expand why there is such a limitation while using a GKE platform, I should be "master" of my routing settings, no?

Thank you,

vagababov commented 3 years ago

The newest knative versions won't close the socket for 48 hours I think and the newest one won't close at all. It'll take time to get to GKE, but I think the 0.17 is there if you upgrade to the most recent clusters.

Now as far as UI goes — I can file a bug internally to relax this.

sneko commented 3 years ago

Great news! When you say "won't close at all", what will be the default value? From Google perspective, if "0" is specified, it goes to their own 300 default value.

Now as far as UI goes — I can file a bug internally to relax this.

It would help yeah! Because otherwise impossible for team members to quickly manage revisions & redeploy because of the form error (>900seconds)

Thank you,

vagababov commented 3 years ago

So the default timeout is the TTFB -- it'll still be enforced. But in the previous knative versions the networking layer enforced request duration as well, but not anymore. So as long you responded with headers in that 300s (or whatever you set the value to) — then request should be able last for the duration of pods in the datapath (ingress, possibly activator, and user pod).

sneko commented 3 years ago

Now as far as UI goes — I can file a bug internally to relax this.

@vagababov did you get any progress on this internally?

sneko commented 3 years ago

The newest knative versions won't close the socket for 48 hours I think and the newest one won't close at all. It'll take time to get to GKE, but I think the 0.17 is there if you upgrade to the most recent clusters.

@vagababov I upgraded my GKE cluster to v1.18.15-gke.1500 but still, if I leave the default timeout to "300" my websocket will be broken by the server (network layer) even if the websocket connection was well established.

I also tried to go over the 24h of timeout through the gcloud beta run deploy but also gcloud run deploy with --timeout +P365dT but I get an error as previously:

ERROR: (gcloud.beta.run.deploy) HTTPError 400: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validation.webhook.serving.knative.dev\" denied the request: validation failed: expected 0 \u003c= 31536000 \u003c= 86400: spec.template.spec.timeoutSeconds","reason":"BadRequest","code":400}

I'm really willing to break this limitation because it brings headaches 😄

Thank you,

EDIT: using kubectl get namespace knative-serving -o 'go-template={{index .metadata.labels "serving.knative.dev/release"}}' says v0.18.0-gke.9. Maybe the latest Anthos version v0.19 would help but I don't know how to force the upgrade even if my GKE cluster seeems compatible.

vagababov commented 3 years ago

That's more of a question to @ZhiminXiang @JRBANCEL

ZhiminXiang commented 3 years ago

@sneko I think you can run the following command to extend the max timeout first:

kubectl patch cm config-defaults -n knative-serving -p '{"data":{"max-revision-timeout-seconds":"172800"}}'

And then restart the webhook to pick up the new max timeout.

Then you should be able to run the gcloud command to set bigger timeout. Let me know if you still have the issue.

sneko commented 3 years ago

Thank you @ZhiminXiang I will give a try tomorrow. In the meantime could you expand on "restarting the webhook" to reload the settings? Any pod to kill? Or a specific command to run?

ZhiminXiang commented 3 years ago

Thank you @ZhiminXiang I will give a try tomorrow. In the meantime could you expand on "restarting the webhook" to reload the settings? Any pod to kill? Or a specific command to run?

yes, just kill the webhook pod.

sneko commented 3 years ago

I just tried the command, it works the gcloud run deploy does not complain thank you!

Note that when I tried to remove the webhook pod, this one got stuck into "TERMINATING" state whereas the new pod was popping up. I didn't want to force deletion with no grace period... so I had to wait like 5 minutes. Is it normal? It's like if the pod was no properly shutting down as soon as it gets a SIGTERM?

ZhiminXiang commented 3 years ago

For CloudRun 0.19, the EnvoyFilter needs to be changed to the following version:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: allowconnect-cluster-local-gateway-new
  namespace: gke-system
spec:
  workloadSelector:
    labels:
      istio: ingress-gke-system
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      listener:
        portNumber: 8081
        filterChain:
          filter:
            name: "envoy.http_connection_manager"
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager"
          http2_protocol_options:
            allow_connect: true

knative / serving