mcaruso85 commented 8 months ago

Version

21.6.0

Platform

Docker Image: node:21.6.0-bullseye

Subsystem

Istio 1.19.4

What steps will reproduce the bug?

I have a basic grpc js client and a grpc python server. The grpc client it does 60 iterations, in each iteration it fires 300 grpc requests in parallel to the grpc server. The server processing is pretty simple, it sleeps 5 seconds and it respond success. Both grpc client and grp server are in the same namespace in the same eks cluster version 1.25 Currently both grpc client and grpc server have both an istio side car version 1.19.5.

How often does it reproduce? Is there a required condition?

This happens always when the application runs inside a pod with envoy side car. (Istio 1.19.5)

What is the expected behavior? Why is that the expected behavior?

The application should finish all the 60 iterations and fire all the requests and get always and ok response.

What do you see instead?

The script does not reach the 60 iterations. It starts to complete iterations with all responses success, but approx at the iteration 9 It fails with the following error:

2024-01-27T07:18:14.951034047Z stderr F HTTP2 18: Http2Session client: goaway 2 received [last stream id: 3999]
2024-01-27T07:18:14.951037006Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed by GOAWAY with code 2 and data undefined
2024-01-27T07:18:14.951041047Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | subchannel | (2) 10.96.13.98:8081 READY -> IDLE
2024-01-27T07:18:14.951043381Z stderr F HTTP2 18: Http2Session client: marking session closed
2024-01-27T07:18:14.951045672Z stderr F HTTP2 18: Http2Session client: submitting goaway
2024-01-27T07:18:14.951047214Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | subchannel_refcount | (2) 10.96.13.98:8081 refcount 2 -> 1
2024-01-27T07:18:14.951049006Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | pick_first | READY -> IDLE
2024-01-27T07:18:14.951050881Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | resolving_load_balancer | dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 READY -> IDLE
2024-01-27T07:18:14.951052756Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 READY -> IDLE
2024-01-27T07:18:14.951054797Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | dns_resolver | resolution update delayed by "min time between resolutions" rate limit
2024-01-27T07:18:14.951056631Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | resolving_load_balancer | dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951058131Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951059756Z stderr F HTTP2 18: Http2Session client: destroying
2024-01-27T07:18:14.951061297Z stderr F HTTP2 18: Http2Session client: start closing/destroying Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.951063089Z stderr F     at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.951064881Z stderr F     at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.951066714Z stderr F   code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.951068631Z stderr F }
2024-01-27T07:18:14.951070172Z stderr F HTTP2 18: Http2Stream 2025 [Http2Session client]: destroying stream

2024-01-27T07:18:14.952236297Z stderr F HTTP2 18: Http2Session client: finishSessionClose
2024-01-27T07:18:14.952238089Z stderr F D 2024-01-27T07:18:14.875Z | v1.9.14 18 | subchannel_call | [7012] Node error event: message=Session closed with error code 2 code=ERR_HTTP2_SESSION_ERROR errno=Unknown system error undefined syscall=undefined

2024-01-27T07:18:14.964811172Z stderr F D 2024-01-27T07:18:14.901Z | v1.9.14 18 | resolving_call | [4999] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"
2024-01-27T07:18:14.964813089Z stderr F HTTP2 18: Http2Session client: finishSessionClose socket end null Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.964815214Z stderr F     at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.964816714Z stderr F     at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.964818381Z stderr F   code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.964820172Z stderr F }
2024-01-27T07:18:14.964821631Z stderr F D 2024-01-27T07:18:14.934Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed with error Session closed with error code 2
2024-01-27T07:18:14.964823172Z stderr F D 2024-01-27T07:18:14.934Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 session closed
2024-01-27T07:18:14.980607506Z stdout F 2024-01-27T07:18:14Z scuttle: Received signal 'child exited', passing to child
2024-01-27T07:18:14.980790964Z stdout F 2024-01-27T07:18:14Z scuttle: Received signal 'urgent I/O condition', ignoring

Additional information

The timeline order is the following: First I see a lot of requests like this in envoy client side where they are ok:

**2024-01-27T07:18:14.817417547Z** stdout F {"bytes_received":5,"connection_termination_details":null,"downstream_peer_uri_san":null,"upstream_local_address":"10.244.1.36:56188","method":"POST","x_envoy_external_address":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","upstream_host":"10.244.1.35:8081","requested_server_name":null,"start_time":"2024-01-27T07:18:09.537Z","path":"/rpc.bookshop.v1.InventoryService/GetBookList","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","request_id":"4d2ba9ed-e3dd-4cd2-8ef4-87369be87a99","protocol":"HTTP/2","downstream_remote_address":"10.244.1.36:34518","x_forwarded_client_cert":null,"route_name":"default","user_agent":"grpc-node-js/1.9.14","x_forwarded_proto":"http",**"response_code":200**,"upstream_service_time":"5215","bytes_sent":106,"duration":5269,"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","response_flags":"-","x_forwarded_for":null,"upstream_transport_failure_reason":null,**"response_code_details":"via_upstream"**}

Then, at some point after a lot of successfull requests like the one above, I see the goaway is sent by the envoy client side:

**2024-01-27T07:18:14.817867Z** debug   envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1208  [Tags: "ConnectionId":"7"] sent goaway code=2   thread=25

And after that I see a lot of requests in the envoy client side with the codec error like these:

**2024-01-27T07:18:14.825341714Z** stdout F {"bytes_sent":0,**"response_flags":"DPE"**,"user_agent":"grpc-node-js/1.9.14","upstream_service_time":null,"protocol":"HTTP/2","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","duration":5240,"upstream_host":"10.244.1.35:8081","x_envoy_external_address":null,"method":"POST","downstream_peer_uri_san":null,"bytes_received":5,"x_forwarded_for":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","start_time":"2024-01-27T07:18:09.577Z","upstream_local_address":"10.244.1.36:56188","requested_server_name":null,"connection_termination_details":null,**"response_code_details":"codec_error:The_user_callback_function_failed"**,"upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","downstream_remote_address":"10.244.1.36:34518","upstream_transport_failure_reason":null,"path":"/rpc.bookshop.v1.InventoryService/GetBookList","response_code":0,"request_id":"23cf4d24-9caa-4d24-b3ca-a136b81ae46d","route_name":"default","x_forwarded_proto":"http","x_forwarded_client_cert":null}

And after that, the client app receives the goaway, and closes the connection (here the app container logs):

**2024-01-27T07:18:14.951034047Z** stderr F HTTP2 18: Http2Session client: goaway 2 received [last stream id: 3999]
**2024-01-27T07:18:14.951037006Z** stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed by GOAWAY with code 2 and data undefined
**2024-01-27T07:18:14.951043381Z** stderr F HTTP2 18: Http2Session client: marking session closed
**2024-01-27T07:18:14.951045672Z** stderr F HTTP2 18: Http2Session client: submitting goaway
**2024-01-27T07:18:14.951059756Z** stderr F HTTP2 18: Http2Session client: destroying
**2024-01-27T07:18:14.951061297Z** stderr F HTTP2 18: Http2Session client: start closing/destroying Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
**2024-01-27T07:18:14.951063089Z** stderr F     at Http2Session.onGoawayData (node:internal/http2/core:690:21)
**2024-01-27T07:18:14.951064881Z** stderr F     at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
**2024-01-27T07:18:14.951066714Z** stderr F   code: 'ERR_HTTP2_SESSION_ERROR'
**2024-01-27T07:18:14.951068631Z** stderr F }
**2024-01-27T07:18:14.951070172Z** stderr F HTTP2 18: Http2Stream 2025 [Http2Session client]: destroying stream
**2024-01-27T07:18:14.952236297Z** stderr F HTTP2 18: Http2Session client: finishSessionClose
**2024-01-27T07:18:14.952238089Z** stderr F D 2024-01-27T07:18:14.875Z | v1.9.14 18 | subchannel_call | [7012] Node error event: message=Session closed with error code 2 code=ERR_HTTP2_SESSION_ERROR errno=Unknown system error undefined syscall=undefined
**2024-01-27T07:18:14.964813089Z** stderr F HTTP2 18: Http2Session client: finishSessionClose socket end null Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
**2024-01-27T07:18:14.964815214Z stderr F**     at Http2Session.onGoawayData (node:internal/http2/core:690:21)
**2024-01-27T07:18:14.964816714Z stderr F**     at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
**2024-01-27T07:18:14.964818381Z stderr F**   code: 'ERR_HTTP2_SESSION_ERROR'
**2024-01-27T07:18:14.964820172Z stderr F** }

The server logs doesnt show anything about a goaway, this all happens inside the pod of the client, between the envoy and the app container.

When I search this codec_error: The_user_callback_function_failed, I get this:

If you are encountering the codec_error: The_user_callback_function_failed message in the context of a Node.js application using gRPC and Envoy, it suggests that there is an issue with the callback function related to gRPC processing.

And I see in the client node app erroring in the callback function:

2024-01-27T07:18:14.951058131Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951059756Z stderr F HTTP2 18: Http2Session client: destroying
2024-01-27T07:18:14.951061297Z stderr F HTTP2 18: Http2Session client: start closing/destroying Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.951063089Z stderr F     at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.951064881Z stderr F     at **Http2Session.callbackTrampoline** (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.951066714Z stderr F   code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.951068631Z stderr F }

bbit8 commented 7 months ago

The reason is this rate limiter, 50121 this is silly!

define NGHTTP2_DEFAULT_STREAM_RESET_BURST 1000

define NGHTTP2_DEFAULT_STREAM_RESET_RATE 33

mcaruso85 commented 7 months ago

@bbit8 do you know if there is any way to prevent this problem? Im trying to understand what is this limit, why suddently the client start to send a lot of reset streams? I mean, if that limit is increased, it will not solve the problem, it will just allow to send more reset streams. If this correct?

I add more info that might be helpful, I start to see all the RST STREAMs after I see this error "codec_error:The_user_callback_function_failed" in the envoy side car that runs alongside the client container.

2024-01-27T07:18:14.825341714Z stdout F {"bytes_sent":0,"response_flags":"DPE","user_agent":"grpc-node-js/1.9.14","upstream_service_time":null,"protocol":"HTTP/2","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","duration":5240,"upstream_host":"10.244.1.35:8081","x_envoy_external_address":null,"method":"POST","downstream_peer_uri_san":null,"bytes_received":5,"x_forwarded_for":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","start_time":"2024-01-27T07:18:09.577Z","upstream_local_address":"10.244.1.36:56188","requested_server_name":null,"connection_termination_details":null,"response_code_details":"codec_error:The_user_callback_function_failed","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","downstream_remote_address":"10.244.1.36:34518","upstream_transport_failure_reason":null,"path":"/rpc.bookshop.v1.InventoryService/GetBookList","response_code":0,"request_id":"23cf4d24-9caa-4d24-b3ca-a136b81ae46d","route_name":"default","x_forwarded_proto":"http","x_forwarded_client_cert":null}