grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.5k stars 651 forks source link

Receiving RST_STREAM with error code 2 while using grpc secure channel #1747

Open riyashah17 opened 3 years ago

riyashah17 commented 3 years ago

Problem description

We are using grpc to communicate between services. For production deployment, we switched to server authentication. We have a bi-directional streaming grpc service - which expects audio chunks in streaming fashion & sends transcripts as responses in a streaming manner. After getting the final response, we get this error: 13 INTERNAL: Received RST_STREAM with error code 2 Wasn't getting this error in case of insecure channel/no authentication. Furthermore, can't find any error on the server side.

Reproduction steps

Will try to include a repo soon.

Environment

Additional context

Using AWS grpc load balancer for server. Server code is implemented in python. Had written a small piece of code to check if this error is encountered in the case of a python client, and we observed that this error is not seen there.

Nodejs client code for initializing the grpc client:

const ssl_creds = grpc.credentials.createSsl();
let client = new zapr_speech_proto.Speech(this.server_address, ssl_creds);
murgatroid99 commented 3 years ago

That error code indicates an "internal error", which can have a variety of causes. I suggest updating to a newer version of @grpc/grpc-js, which has more detailed information for that error, and then testing again.

riyashah17 commented 3 years ago

@murgatroid99 Hey, thanks for the prompt reply. I upgraded the @grpc/grpc-js package version to 1.2.12 and tested again: Still getting the same error: Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

murgatroid99 commented 3 years ago

Unfortunately, that message means that that is all of the information the client has about the error. It looks like the server, or maybe some intermediate proxy, is sending an HTTP/2-level error code 2, which indicates an "internal error".

Kupstas commented 3 years ago

@riyashah17 Our balancer is cutting stream when data is not being sent for more than 1 min. Maybe you have the same problem

murgatroid99 commented 3 years ago

If that is the problem, it may help to change the keepalive time so that the client sends pings more frequently than once per minute, to keep the connection alive. That is controlled by the channel option "grpc.keepalive_time_ms".

Kupstas commented 3 years ago

@murgatroid99 thanks for response. I've tried a few keep-alive time intervals but it has no effect

hamxabaig commented 2 years ago

We're getting the same issue with AWS ELB http/2. Attached image with keepalive didn't help.

image

Update: I think we'll have to set GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA to 0 in order to keep sending pings to keep the connection alive. I'll post here if it fixes our problem.

murgatroid99 commented 2 years ago

The options in that image will not enable keepalive. To enable keepalive, you need to set the option grpc.keepalive_time_ms, which determines when pings will be sent. The grpc.keepalive_timeout_ms option specifies how long the client will wait for a ping response before deciding that the connection has been lost. The other two options are not understood by @grpc/grpc-js; it will always send pings without data.

hamxabaig commented 2 years ago

@murgatroid99 Thank you so much for replying, I'll try grpc.keepalive_time_ms and let you know if it works.

hkharmfulbear commented 2 years ago

This might solve your problem. Please check your runtimeconfig and set "use": { "emulator": "false" }, I am so sorry if you have already tried this.

alexsanderp commented 2 years ago

Guys, did you manage to solve this problem?

hamxabaig commented 2 years ago

Our problem was AWS Loadbalancer, it was dropping the connection after 60secs. Changing its config fixed it

DominikSchwager commented 2 years ago

Our problem was AWS Loadbalancer, it was dropping the connection after 60secs. Changing its config fixed it

Would you mind sharing how you changed the configuration? We are encountering the same problem.

krypticJinkle commented 2 years ago

@DominikSchwager you can go the load balancer section in AWS console, select the LB and then you can see the "Attributes" section in the bottom half of screen after a bit of scrolling. There you can change it. But even that didn't work out for me. Did anyone solve this by some other method?

jovanmilenkoski commented 2 years ago

If you try the old grpc package with AWS ALB and grpc.keepalive_time_ms set to be lower than ALB idle timeout setting, the issue is not present.

So that tells me that the issue lies somewhere in @grpc/grpc-js package

krypticJinkle commented 2 years ago

add keep-alive parameters with appropriate values

dennisameling commented 1 year ago

We're running into a similar situation where we run an AWS Lambda that does some gRPC calls to one of our services. We're running an AWS ALB (Application Load Balancer) which has its timeout configured at 60 seconds.

Most of our calls work without issues, but every few hours, we're seeing the Received RST_STREAM with code 2 (Internal server error) error pop up. So it's a bit hard to reproduce.

Things we've observed so far:

Here's a debug log leading up to the issue:

D 2023-10-11T02:53:42.692Z | channel | (2) dns:REDACTED:443 createResolvingCall [152] method=""/REDACTED"", deadline=2023-10-11T02:53:47.692Z
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Created
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Deadline will be reached in 4962ms
D 2023-10-11T02:53:42.730Z | resolving_call | [152] Deadline: 2023-10-11T02:53:47.692Z
D 2023-10-11T02:53:42.730Z | resolving_call | [152] start called
D 2023-10-11T02:53:42.731Z | resolving_call | [152] startRead called
D 2023-10-11T02:53:42.750Z | resolving_call | [152] halfClose called
D 2023-10-11T02:53:42.750Z | resolving_call | [152] write() called with message of length 38
D 2023-10-11T02:53:42.770Z | channel | (2) dns:REDACTED:443 createLoadBalancingCall [154] method=""/REDACTED""
D 2023-10-11T02:53:42.770Z | channel | (2) dns:REDACTED:443 createRetryingCall [153] method=""/REDACTED""
D 2023-10-11T02:53:42.770Z | resolving_call | [152] Created child [153]
D 2023-10-11T02:53:42.770Z | retrying_call | [153] start called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] Pick called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] Pick result: COMPLETE subchannel: (35) 10.130.2.141:443 status: undefined undefined
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] start called
D 2023-10-11T02:53:42.771Z | load_balancing_call | [154] startRead called
D 2023-10-11T02:53:42.771Z | retrying_call | [153] Created child call [154] for attempt 1
D 2023-10-11T02:53:42.771Z | retrying_call | [153] startRead called
D 2023-10-11T02:53:42.791Z | transport_flowctrl | (44) 10.130.2.141:443 local window size: 65066 remote window size: 2147483127
D 2023-10-11T02:53:42.791Z | transport_internals | (44) 10.130.2.141:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2023-10-11T02:53:42.810Z | keepalive | (44) 10.130.2.141:443 Sending ping with timeout 20000ms
D 2023-10-11T02:53:42.810Z | load_balancing_call | [154] Created child call [155]
D 2023-10-11T02:53:42.810Z | retrying_call | [153] write() called with message of length 43
D 2023-10-11T02:53:42.870Z | load_balancing_call | [154] write() called with message of length 43
D 2023-10-11T02:53:42.870Z | retrying_call | [153] halfClose called
D 2023-10-11T02:53:42.870Z | subchannel_call | [155] sending data chunk of length 43
D 2023-10-11T02:53:42.870Z | subchannel_call | [155] write() called with message of length 43
D 2023-10-11T02:53:42.992Z | subchannel_call | [155] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:42.992Z | subchannel_call | [155] HTTP/2 stream closed with code 2
D 2023-10-11T02:53:43.030Z | load_balancing_call | [154] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.030Z | load_balancing_call | [154] Received status
D 2023-10-11T02:53:43.030Z | retrying_call | [153] Received status from child [154]
D 2023-10-11T02:53:43.030Z | retrying_call | [153] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [154] in state ACTIVE
D 2023-10-11T02:53:43.050Z | resolving_call | [152] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.050Z | resolving_call | [152] Received status
D 2023-10-11T02:53:43.050Z | retrying_call | [153] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)""
D 2023-10-11T02:53:43.132Z | keepalive | (44) 10.130.2.141:443 Ping failed with error HTTP2 ping cancelled
D 2023-10-11T02:53:43.170Z | keepalive | (44) 10.130.2.141:443 Received ping response

As you can see, the keepalive call is sent correctly (Sending ping with timeout 20000ms), and this happens consistently as we expect, but it doesn't seem to help regarding the RST_STREAM error.

I'll go back to 1.8.0 for now and will slowly work towards the problematic 1.8.21, so we can identify which version introduced this problem. This might take a couple days though because the error only pops up every few hours.

murgatroid99 commented 1 year ago

To clarify, does the ALB drop the connection after 60 seconds of inactivity, or just 60 seconds after it is created? If it is inactivity, is the client inactive for the 60 seconds preceding that log? If so, the keepalive setting is likely not actually helping, because keepalive pings are only sent if there is a pending request. That behavior can be changed with the channel option grpc.keepalive_permit_without_calls. Note that this will just cause it to keep pinging while the client is not actively using the connection, which may eventually result in the ALB dropping the connection anyway. However, if that connection drop is detected by the keepalives, it still shouldn't result in the kind of request failure you are seeing.

dennisameling commented 1 year ago

Sorry for the late follow-up here.

To clarify, does the ALB drop the connection after 60 seconds of inactivity, or just 60 seconds after it is created?

60 seconds of inactivity, that is. I wonder if keepalive_permit_without_calls would be of any help here, also considering my notes below. What do you think?

I wanted to capture the logs of the full lifetime of a problematic Lambda. That was more difficult than expected, because we couldn't filter the logs by something like "Lambda unique ID" and have multiple Lambdas running in parallel. I added such ID to the logs, which allowed me to capture the full lifetime:

lambda-grpc-logs.csv

Some things that stand out to me:

I wonder if we could prevent the GOAWAY somehow or if our code (or the library?) would need to be tuned to better handle this scenario. What do you think?

Subset of logs in the seconds leading up to the RST_STREAM error
22/10/2023 22:34;D 2023-10-22T20:34:52.466Z | keepalive | (23) 10.132.1.153:443 Starting keepalive timer for 50000ms
22/10/2023 22:34;D 2023-10-22T20:34:52.485Z | subchannel | (6) 10.132.1.153:443 CONNECTING -> READY
22/10/2023 22:34;"D 2023-10-22T20:34:52.528Z | transport | (23) 10.132.1.153:443 local settings acknowledged by remote: {""headerTableSize"":4096,""enablePush"":true,""initialWindowSize"":65535,""maxFrameSize"":16384,""maxConcurrentStreams"":4294967295,""maxHeaderListSize"":4294967295,""maxHeaderSize"":4294967295,""enableConnectProtocol"":false}"
22/10/2023 22:37;D 2023-10-22T20:37:14.627Z | subchannel_refcount | (6) 10.132.1.153:443 refcount 1 -> 0
22/10/2023 22:37;D 2023-10-22T20:37:14.646Z | subchannel | (6) 10.132.1.153:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:14.666Z | keepalive | (22) 10.132.2.131:443 Sending ping with timeout 20000ms
22/10/2023 22:37;D 2023-10-22T20:37:14.746Z | keepalive | (23) 10.132.1.153:443 Sending ping with timeout 20000ms
22/10/2023 22:37;Handling authorization request (event version 2.0) for routeArn arn:aws:execute-api:us-west-2:REDACTED:REDACTED/production-default/POST/REDACTED/
22/10/2023 22:37;"D 2023-10-22T20:37:14.766Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createResolvingCall [536] method=""/REDACTED/Authenticate"", deadline=2023-10-22T20:37:19.766Z"
22/10/2023 22:37;D 2023-10-22T20:37:14.786Z | resolving_call | [536] Created
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] halfClose called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] write() called with message of length 212
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] Deadline will be reached in 4960ms
22/10/2023 22:37;D 2023-10-22T20:37:14.806Z | resolving_call | [536] Deadline: 2023-10-22T20:37:19.766Z
22/10/2023 22:37;"D 2023-10-22T20:37:14.807Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createRetryingCall [537] method=""/REDACTED/Authenticate"""
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] startRead called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] Pick result: COMPLETE subchannel: (21) 10.132.2.131:443 status: undefined undefined
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] Pick called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | load_balancing_call | [538] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] Created child call [538] for attempt 1
22/10/2023 22:37;"D 2023-10-22T20:37:14.826Z | channel | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 createLoadBalancingCall [538] method=""/REDACTED/Authenticate"""
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | retrying_call | [537] start called
22/10/2023 22:37;D 2023-10-22T20:37:14.826Z | resolving_call | [536] Created child [537]
22/10/2023 22:37;D 2023-10-22T20:37:14.827Z | transport_flowctrl | (22) 10.132.2.131:443 local window size: 65399 remote window size: 2147483604
22/10/2023 22:37;D 2023-10-22T20:37:14.846Z | load_balancing_call | [538] Created child call [539]
22/10/2023 22:37;D 2023-10-22T20:37:14.827Z | transport_internals | (22) 10.132.2.131:443 session.closed=false session.destroyed=false session.socket.destroyed=false
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | retrying_call | [537] halfClose called
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | subchannel_call | [539] sending data chunk of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | subchannel_call | [539] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | load_balancing_call | [538] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:14.866Z | retrying_call | [537] write() called with message of length 217
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | pick_first | READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.246Z | subchannel | (21) 10.132.2.131:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.028Z | transport | (22) 10.132.2.131:443 connection closed by GOAWAY with code 0
22/10/2023 22:37;D 2023-10-22T20:37:15.266Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.266Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.267Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.267Z | dns_resolver | Looking up DNS hostname REDACTED_INTERNAL_HOSTNAME
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | dns_resolver | Looking up DNS hostname REDACTED_INTERNAL_HOSTNAME
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | pick_first | READY -> IDLE
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 2 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.286Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 IDLE -> CONNECTING
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | retrying_call | [537] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | retrying_call | [537] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [538] in state ACTIVE
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | retrying_call | [537] Received status from child [538]
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | load_balancing_call | [538] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | load_balancing_call | [538] Received status
22/10/2023 22:37;"D 2023-10-22T20:37:15.289Z | subchannel_call | [539] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | subchannel_call | [539] HTTP/2 stream closed with code 2
22/10/2023 22:37;"D 2023-10-22T20:37:15.290Z | resolving_call | [536] ended with status: code=13 details=""Received RST_STREAM with code 2 (Internal server error)"""
22/10/2023 22:37;D 2023-10-22T20:37:15.289Z | resolving_call | [536] Received status
22/10/2023 22:37;"{""hostname"":""arn:aws:lambda:us-west-2:REDACTED:function:production-lambda-authorizer"",""lambda"":{""arn"":""arn:aws:lambda:us-west-2:REDACTED:function:production-lambda-authorizer"",""request_id"":""268d203b-b278-4376-8f06-1b3f467fe629""},""code"":13,""level"":""error"",""service"":""lambda-authorizer"",""details"":""Received RST_STREAM with code 2 (Internal server error)"",""lambda_unique_id"":""543e4bb7-5222-4292-9a2e-7997451b9244"",""status"":""info"",""timestamp"":1698007035347,""lambda_boot_time"":""2023-10-22T19:58:58.662Z""}"
22/10/2023 22:37;D 2023-10-22T20:37:15.369Z | keepalive | (22) 10.132.2.131:443 Ping failed with error HTTP2 ping cancelled
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | keepalive | (22) 10.132.2.131:443 Starting keepalive timer for 50000ms
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | keepalive | (22) 10.132.2.131:443 Received ping response
22/10/2023 22:37;D 2023-10-22T20:37:15.426Z | transport | (22) 10.132.2.131:443 session closed
22/10/2023 22:37;D 2023-10-22T20:37:15.448Z | dns_resolver | Resolved addresses for target dns:REDACTED_INTERNAL_HOSTNAME:443: [10.132.2.131:443,10.132.1.153:443]
22/10/2023 22:37;"D 2023-10-22T20:37:15.486Z | subchannel | (24) 10.132.1.153:443 Subchannel constructed with options { ""grpc.keepalive_time_ms"": 50000, ""grpc.keepalive_permit_without_calls"": 1}"
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | pick_first | IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | pick_first | Start connecting to subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 1 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 1 -> 2
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 0 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | transport | dns:REDACTED_INTERNAL_HOSTNAME:443 creating HTTP/2 session to 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.506Z | subchannel | (21) 10.132.2.131:443 IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | pick_first | IDLE -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 2 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 2 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.508Z | dns_resolver | Resolved addresses for target dns:REDACTED_INTERNAL_HOSTNAME:443: [10.132.2.131:443,10.132.1.153:443]
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | pick_first | CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.510Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.509Z | pick_first | CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.566Z | transport | (23) 10.132.1.153:443 session closed
22/10/2023 22:37;D 2023-10-22T20:37:15.510Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> CONNECTING
22/10/2023 22:37;D 2023-10-22T20:37:15.766Z | keepalive | (25) 10.132.2.131:443 Starting keepalive timer for 50000ms
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | pick_first | Pick subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel | (21) 10.132.2.131:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | connectivity_state | (2) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 2 -> 1
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 4 -> 3
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | subchannel_refcount | (21) 10.132.2.131:443 refcount 3 -> 4
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | Pick subchannel with address 10.132.2.131:443
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | connectivity_state | (1) dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | resolving_load_balancer | dns:REDACTED_INTERNAL_HOSTNAME:443 CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.787Z | pick_first | CONNECTING -> READY
22/10/2023 22:37;D 2023-10-22T20:37:15.786Z | subchannel_refcount | (24) 10.132.1.153:443 refcount 3 -> 2
22/10/2023 22:37;"2023-10-22T20:37:15.789Z  268d203b-b278-4376-8f06-1b3f467fe629    ERROR   [dd.trace_id=3303256522662915096 dd.span_id=6980096509857599035] Invoke Error   {""errorType"":""Error"",""errorMessage"":""13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)"",""code"":13,""details"":""Received RST_STREAM with code 2 (Internal server error)"",""metadata"":{},""level"":""error"",""lambda_unique_id"":""543e4bb7-5222-4292-9a2e-7997451b9244"",""lambda_boot_time"":""2023-10-22T19:58:58.662Z"",""stack"":[""Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)"","" at callErrorFromStatus (/var/task/node_modules/@grpc/grpc-js/build/src/call.js:31:19)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:192:76)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)"","" at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)"","" at /var/task/node_modules/@grpc/grpc-js/build/src/resolving-call.js:99:78"","" at process.processTicksAndRejections (node:internal/process/task_queues:77:11)"",""for call at"","" at ServiceClientImpl.makeUnaryRequest (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:160:32)"","" at ServiceClientImpl. (/var/task/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)"","" at /opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:159:27"","" at run (/opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:77:22)"","" at AsyncResourceStorage.run (/opt/nodejs/node_modules/dd-trace/packages/datadog-core/src/storage/async_resource.js:69:14)"","" at /opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:106:18"","" at Channel.runStores.ActiveChannelPrototype.runStores (/opt/nodejs/node_modules/dd-trace/packages/diagnostics_channel/src/index.js:86:12)"","" at callMethod (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:149:23)"","" at ServiceClientImpl.wrapped [as authenticate] (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/grpc/client.js:96:12)"","" at /var/task/api_key_authorizer_lambda.js:209:43""]}"
22/10/2023 22:37;"D 2023-10-22T20:37:15.907Z | transport | (25) 10.132.2.131:443 local settings acknowledged by remote: {""headerTableSize"":4096,""enablePush"":true,""initialWindowSize"":65535,""maxFrameSize"":16384,""maxConcurrentStreams"":4294967295,""maxHeaderListSize"":4294967295,""maxHeaderSize"":4294967295,""enableConnectProtocol"":false}"
murgatroid99 commented 1 year ago

The GOAWAY is not the problem here. A GOAWAY with code 0 indicates "no error", i.e. that the client should stop creating new streams on the connection, but existing streams can still finish. The RST_STREAM with code 2 is "Internal server error", so in my opinion that indicates that there is a bug in the ALB. If the server is intentionally not processing that stream because it raced with the GOAWAY, it should respond with an RST_STREAM with code 7, which means "refused stream". In that case the gRPC library would transparently retry the request on a new connection, and you would never see the error. The ping failure just means that the ALB closed the connection, which is unsurprising if it told the client to go away and it closed the only open stream.

To summarize, I believe that the problem here is the RST_STREAM with error code 2, and that the ALB is behaving incorrectly by sending it.

dennisameling commented 1 year ago

Thanks once again for your quick reply and the context it provides. I'll reach out to our Account Manager at AWS to see if we can engage the ALB team somehow. I'd really like to figure out if this is indeed an ALB bug and if AWS can fix it. To be continued! 👀

As a workaround, I'm now just creating and closing gRPC clients on every Lambda invocation. So in the example I provided above, creating and closing 177 clients in a 70-minute timespan instead of maintaining 1 long-lived client for the entire 70 minutes. Not ideal because of the overhead, but we haven't seen the RST_STREAM issues since.

mehulparmariitr commented 7 months ago

The issue is opened since 2021 and still not solved?

We are also facing this issue while querying Google BQ projects from PowerBI windows Desktop app using Google BigQuery Connector.

I am getting this error for big projects which have large number of rows - Failed to save modifications to the server. Error returned: 'OLE DB or ODBC error: [DataSource.Error] ERROR [HY000] [Microsoft][BigQuery] (310) Storage API Error: Error: { code: UNKNOWN, message: Stream removed }. '. And I am not using ODBC driver but Google Big Query connector to connect to GCP Big Query. Though I have ODBC Simba driver installed in my VM.

PS - I think Google BQ connector internally uses Simba ODBC Driver.

murgatroid99 commented 7 months ago

@mehulparmariitr This issue cannot be "solved" entirely because not every instance of this error is the fault of this library.

And it looks like you are not even getting this error at all. This issue is about errors with the code "INTERNAL" and the text "Received RST_STREAM with error code 2". Your error has the code "UNKNOWN" and the text "Stream removed". That error is in fact mostly solved in more recent versions of this library, so I recommend upgrading the dependency on @grpc/grpc-js if possible.

mehulparmariitr commented 7 months ago

@murgatroid99 This issue in powerBI is related to RPC error also. I have seen few people reporting the same error attached to this public ticket and github issue https://issuetracker.google.com/issues/240305081

Unfortunately the fix needs to be implemented by Microsoft team in Google BigQuery connector in PowerBI to do retries.

dennisameling commented 4 months ago

Just ensured that both our client (grpc-node) and server (grpc-java) are on the latest versions: 1.11.1 and 1.65.1, respectively. Unfortunately the issue remains.

image

To summarize, I believe that the problem here is the RST_STREAM with error code 2, and that the ALB is behaving incorrectly by sending it.

I would still like to explore this a bit further. In the meantime, we found out that our company doesn't have an AWS Technical Support contract as the cheapest plan already costs 3% of our monthly AWS bill. This actually might be worth considering at this point, at least for 1-2 months while we debug this. I'll discuss this with our team.

I was just going through some more internal logs of our Java-based gRPC clients and noticed a similar error there:

Error on connection (Receiver): io.grpc.StatusRuntimeException: INTERNAL: RST_STREAM closed stream. HTTP/2 error code: PROTOCOL_ERROR

... however this was just a short one-day spike while we were having an outage, and the server was crashing/restarting a couple times. I did see someone else who's having similar issues with Go-based clients, and for them, removing the metadata that the AWS ALB added fixed it. Something to consider I guess.

At this point I do believe that the AWS ALB is indeed causing the issues here, and not necessarily grpc-node. I do wonder however why this is not a problem with our Java-based gRPC clients, and only with our Node-based Lambda client. I'll keep digging and will report back here later. If anyone has thoughts or ideas in the meantime, those would be greatly appreciated. 🙏🏼

rohitdigi commented 2 months ago

This Problem was fixed for us by reducing time for checking ping to 10 secs , as the connectivity breaks after 50 sec if no data is transferred between client and server on secure channel.

luca992 commented 1 month ago

This Problem was fixed for us by reducing time for checking ping to 10 secs , as the connectivity breaks after 50 sec if no data is transferred between client and server on secure channel.

Can you post your full config?