`@grpc/grpc-js` throw 'Received RST_STREAM with code 0' with retry enabled

nooop3 commented 1 year ago

Problem description

@grpc/grpc-js throw 'Received RST_STREAM with code 0' with retry enabled

Reproduction steps

Start an example HelloWorld Golang grpc server in Kubernetes and enable the istio sidecar
To reduce the connection reset(reset reason: connection termination), we set the max_concurrent_streams: 256
Making multi-calls same time by nodeJs with large responses (like response size 6k/per request, and 1000 requests)
Got Error: 13 INTERNAL: Received RST_STREAM with code 0

Environment

Kubernetes with istio envoy sidecar enabled(Thanks orbstack Kubernetes environment, I can debug this error in local VSCode environment)
- NodeJs -> client sidecar(envoy) -> server
- NodeJs -> server sidecar(envoy) -> server
- NodeJs -> client sidecar(envoy) -> server sidecar(envoy) -> server
Node version: v16.18.1
Node installation method: Docker image
If applicable, compiler version: N/A
Package name and version: @grpc/grpc-js version > 1.7.3

Additional context

Logs with GRPC_TRACE=all GRPC_VERBOSITY=DEBUG

D 2023-09-09T23:58:35.160Z | load_balancing_call | [1513] Received metadata
D 2023-09-09T23:58:35.160Z | retrying_call | [1512] Received metadata from child [1513]
D 2023-09-09T23:58:35.160Z | retrying_call | [1512] Committing call [1513] at index 0
D 2023-09-09T23:58:35.160Z | resolving_call | [256] Received metadata
D 2023-09-09T23:58:35.160Z | subchannel_call | [3256] HTTP/2 stream closed with code 0
D 2023-09-09T23:58:35.160Z | subchannel_call | [3256] ended with status: code=13 details="Received RST_STREAM with code 0"
D 2023-09-09T23:58:35.160Z | load_balancing_call | [1513] Received status
D 2023-09-09T23:58:35.160Z | load_balancing_call | [1513] ended with status: code=13 details="Received RST_STREAM with code 0"
D 2023-09-09T23:58:35.160Z | retrying_call | [1512] Received status from child [1513]
D 2023-09-09T23:58:35.160Z | retrying_call | [1512] state=COMMITTED handling status with progress PROCESSED from child [1513] in state ACTIVE
D 2023-09-09T23:58:35.160Z | retrying_call | [1512] ended with status: code=13 details="Received RST_STREAM with code 0"
D 2023-09-09T23:58:35.160Z | resolving_call | [256] Received status
D 2023-09-09T23:58:35.160Z | resolving_call | [256] ended with status: code=13 details="Received RST_STREAM with code 0"

Error: 13 INTERNAL: Received RST_STREAM with code 0
    at callErrorFromStatus (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
    at Object.onReceiveStatus (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
    at Object.onReceiveStatus (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
for call at
    at ServiceClientImpl.makeUnaryRequest (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/client.js:160:32)
    at ServiceClientImpl.<anonymous> (/[redacted]/grpc-node/examples/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
    at /[redacted]/grpc-node/examples/helloworld/dynamic_codegen/client.js:73:14
    at new Promise (<anonymous>)
    at main (/[redacted]/grpc-node/examples/helloworld/dynamic_codegen/client.js:72:21)
    at Object.<anonymous> (/[redacted]/grpc-node/examples/helloworld/dynamic_codegen/client.js:102:1)
    at Module._compile (node:internal/modules/cjs/loader:1155:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1209:10)
    at Module.load (node:internal/modules/cjs/loader:1033:32)
    at Function.Module._load (node:internal/modules/cjs/loader:868:12) {
  code: 13,
  details: 'Received RST_STREAM with code 0',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}

Methods I can apply

Disable grpc.enable_retries by set grpc.enable_retries = 0
Downgrade to the version 1.7.3

It seems we can just ignore this error, and doesn't affect the results. RST_STREAM with code NO_ERROR in HTTP2 RFC Error Codes

There are some similar cases in other projects:

So can we just ignore this NO_ERROR in the Node js SDK? Thanks in advance.

nooop3 commented 1 year ago

In addition, the Golang client doesn't have this issue in the same environment.

murgatroid99 commented 1 year ago

We have had similar reports before, and people who have experienced it have frequently reported that a configuration change fixed the problem. For example, https://github.com/grpc/grpc-node/issues/1532#issuecomment-1192260364:

Please check your Istio Sidecar. In my case, the traffic to GRCP client was blocked by the Istio Sidecar.

After I disable the Istio Sidecar with annotation, it's working fine now.

https://github.com/grpc/grpc-node/issues/1532#issuecomment-1524803546:

had the same issue (Error: 13 INTERNAL: Received RST_STREAM with code 0) but it turns out that the service I was calling was not accepting calls from outside the VPC, updated it to accept all traffic and issue was sorted.

https://github.com/grpc/grpc-node/issues/2004#issuecomment-1694188848

Enabling HTTP/2 end-to-end in my deployment settings on Cloud Run fixed the issue for me.

If you extend your log excerpt back a little further to a line that includes subchannel_call | [3256] Received server headers: I think we can get a better idea of what is happening here.

nooop3 commented 1 year ago

Thanks for your reply. I had viewed those comments before:

We have had similar reports before, and people who have experienced it have frequently reported that a configuration change fixed the problem. For example, #1532 (comment):

Please check your Istio Sidecar. In my case, the traffic to GRCP client was blocked by the Istio Sidecar. After I disable the Istio Sidecar with annotation, it's working fine now.

The traffic hasn't been blocked by Istio, the script works well if I disable the NodeJs grpc retry. If I disable the Istio sidecar with the annotations, it's working fine either disable the grpc retry or not. But we need to use the Istio to load balance the grpc requests, so we need to enable the sidecar through the cluster.

#1532 (comment):

had the same issue (Error: 13 INTERNAL: Received RST_STREAM with code 0) but it turns out that the service I was calling was not accepting calls from outside the VPC, updated it to accept all traffic and issue was sorted.

I can reproduce this issue and debug with vscode in the local kubernetes by the orbstack. So the service can accept all traffic from out of the cluster.

#2004 (comment)

Enabling HTTP/2 end-to-end in my deployment settings on Cloud Run fixed the issue for me.

We are running the service in the EKS, and also can reproduce in the local kubernetes environment, we can not find any equal settings to enable the HTTP/2 end-to-end for Istio.

If you extend your log excerpt back a little further to a line that includes subchannel_call | [3256] Received server headers: I think we can get a better idea of what is happening here.

I'll upload more log tomorrow, and share the Wireshark screenshot.

murgatroid99 commented 1 year ago

the script works well if I disable the NodeJs grpc retry

This part really confuses me. I can't think of anything in the retry code that would impact anything else that's happening in this issue. It might help to have more of the logs, back to the one that contains createRetryingCall [1512]. It might also help to have the logs from a call that is dropped for exceeding max_concurrent_streams with retries disabled, for comparison.

The traffic hasn't been blocked by Istio

If Istio is enforcing max_concurrent_streams, then this stream has been blocked by Istio, in the sense that it isn't forwarded to the server. If Istio is responding at the HTTP/2 protocol level instead of the gRPC protocol level, it wouldn't be surprising for it to respond in this way. Maybe there's a configuration to make its responses gRPC protocol-aware.

nooop3 commented 1 year ago

There are logs:

Enable the reries:

grpc-rst-stream.log

Disable the retries

grpc-diable-retries.log

Can we just ignore the NO_ERROR when retry is enabled?

murgatroid99 commented 1 year ago

Your two tests are doing different things. In the "Enable the retries" case, it makes 1000 requests, and in the "Disable the retries" case, it makes 512 requests. And there's nothing different in the response handling between the two logs, rather the second test just never receives the kind of response that triggers the error in the first test. I think the "Disable the retries" test just isn't running into the max concurrent streams limit.

The fact is that there is no reasonable way to handle this kind of response other than with an error. The client receives headers, which indicates that the server at least started to process the request, and the client treats that as a signal that the request can't be retried. And an RST_STREAM without a grpc-status trailer is considered a gRPC protocol error, which is why it's reported as an error. And even if we treated that as a success instead, the request would still fail at a higher layer because it never receives the response message it expects for that request.

nooop3 commented 1 year ago

I reduced the request number to 512 as the log size was too large yesterday. And if I make 512 requests with retry enabled, this issue still exists:

grpc-rst-stream-512.log

To mitigate this issue, we have to pin the version of @grpc/grpc-js to 1.7.3.

murgatroid99 commented 1 year ago

OK, I see what happened: having retries enabled changed the sequencing of some low-level operations, which caused a different outcome with your specific test setup. With retries disabled, each HTTP/2 stream starts and ends in a single synchronous operation, while with retries enabled, starting and ending the stream are separated into two asynchronous operations. The result is that with retries disabled, each stream starts and ends one at a time, while with retries enabled, all of the streams start, and then all of the streams end. This means that in the "retries disabled" case, the remote end can finish processing the first stream before the last stream even starts, while in the "retries enabled" case, the remote end is guaranteed to see all of the streams open at once. So, the primary difference between the observable outcomes in these two tests is that the "retries enabled" test triggers the "max concurrent streams" limit, and the "retries disabled" test does not. If you instead forced the client to reach the "max concurrent streams" limit by opening streaming requests without closing them, you would see the same error in both cases.

The primary problem here, still, is that when the "max concurrent streams" limit is reached, Istio closes the stream in a way that is not valid within the gRPC protocol. Istio is what needs to be changed here. There may be a configuration option to change how it responds in this situation, but if not, this is a bug in Istio.

nooop3 commented 1 year ago

The max_concurrent_streams is the HTTP2 option that is included in the setting frame from the server(or Istio side). When the client receives the max_concurrent_streams setting, the client should follow this setting. Is there any config we can set on the NodeJS side?

murgatroid99 commented 1 year ago

That is a fair point. There should be client-side enforcement of the max concurrent streams limit, but as far as I know, there isn't.

devpira commented 1 year ago

Hello, we were also facing the exact issue with the same setup for some time now. After seeing this issue and pinning it to version 1.7.3, the errors have gone away. Ideally we don't want to keep it pinned 1.7.3. Just wanted to bring it up to see if there is anything that could be done? I can provide details if needed.

murgatroid99 commented 1 year ago

Please share whatever details you have. They can only help understand the problem.

murgatroid99 commented 12 months ago

I think it's possible that in this specific test scenario, the max concurrent streams setting may not be enforced on the client because all of the streams are started before the client receives the server's settings frame. That could be tested by making a request, and then once it succeeds synchronously making a number of requests greater than the max concurrent streams setting.

maylorsan commented 11 months ago

We've been facing this error in our production environment for the past 3 days and it's occurred roughly 15k times: Error: 13 INTERNAL: Received RST_STREAM with code 1

The error is triggered when executing the following code:

const writeResult = await admin
      .firestore()
      .collection(FirestoreCollections.Users)
      .doc(userID)
      .update(fieldsToUpdate); // This line throws the error

We've attempted several methods, but none have resolved the problem:

Switched the update method to use set and set with {merge: true}, but this didn't work.
Created a new cloud function dedicated solely to this method. Surprisingly, it didn't work either. However, it's noteworthy that we have several other cloud functions that update user data using the same method, and they function as expected. What's even more perplexing is that this issue arises in about 60% of our http cloud function calls, and it seems to occur randomly. We haven't been able to identify a consistent pattern.

Interestingly, everything operates flawlessly in our development project. The only difference is that the development project has a smaller User collection.

@murgatroid99 Is any updates on this issue? We using nodejs v16.20.0 and firebase-admin library

DaniyalCuboid19 commented 11 months ago

We've been facing this error in our production environment for the past 3 days and it's occurred roughly 15k times: Error: 13 INTERNAL: Received RST_STREAM with code 1

The error is triggered when executing the following code:
const writeResult = await admin
      .firestore()
      .collection(FirestoreCollections.Users)
      .doc(userID)
      .update(fieldsToUpdate); // This line throws the error
We've attempted several methods, but none have resolved the problem:

Switched the update method to use set and set with {merge: true}, but this didn't work.

Created a new cloud function dedicated solely to this method. Surprisingly, it didn't work either. However, it's noteworthy that we have several other cloud functions that update user data using the same method, and they function as expected. What's even more perplexing is that this issue arises in about 60% of our http cloud function calls, and it seems to occur randomly. We haven't been able to identify a consistent pattern.

Interestingly, everything operates flawlessly in our development project. The only difference is that the development project has a smaller User collection.

@murgatroid99 Is any updates on this issue? We using nodejs v16.20.0 and firebase-admin library

Same problem i am facing here. is there any update ?

murgatroid99 commented 11 months ago

@maylorsan Code 1 is different from code 0. Please file a separate issue.

maylorsan commented 11 months ago

@murgatroid99 Thanks for your assistance! I've documented the steps I took to resolve my issue in this comment. Hopefully, it can be of help to others facing a similar problem.

Cheers! 🚀

murgatroid99 commented 11 months ago

For future reference: the reported "RST_STREAM with code 1" errors appear to have been caused by a server problem. The client was behaving correctly by reporting that error.

ianebot commented 9 months ago

Hi @murgatroid99,

we are facing exactly _Error: 13 INTERNAL: Received RSTSTREAM with code 0 issue randomly. What I can confirm:

Istio 1.20.0 (envoy 580bf9cc561a012d4eb14af48d496d6748307e59/1.28.1-dev/Clean/RELEASE/BoringSSL) enabled, either on the client and/or server.
grpc-js > 1.7.3. (tested with 1.8.0, 1.9.0, 1.9.1, 1.9.2, 1.9.5 and 1.9.11)
our app failed with or without retry mechanism
A test of 100 request with 10 in paralells is enough to display 2-6 failed request with error 13 INTERNAL: Received RST_STREAM with code 0

Workaround:

Option A: Disable Istio (but this is not recommended due to mTLS and also for gRPC load balancing)
Option B: pin grpc-js to v.1.7.3 (but not ideal as this version is dated to 2022 and a bit far from the latest one 1.9.13, specially for sec patches)

Here you have the Envoy(Istio) logs for a connection with issues: gRPC NodeJS client (yyy-frontend-bff) towards gRPC golang server (xxx-backend)

2024-01-04T10:35:58.109302Z     debug   envoy router external/envoy/source/common/router/router.cc:732  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] router decoding headers:
':authority', 'app.xxx-backend:6002'
':method', 'POST'
':path', '/api.xxx_backend.ReadDataService/GetDataList'
':scheme', 'http'
'traceparent', '00-00000000000000002d5b1c96203a505c-736baaeffcc52667-01'
'tracestate', 'dd=t.dm:-1;s:1'
'grpc-accept-encoding', 'identity,deflate,gzip'
'accept-encoding', 'identity'
'grpc-timeout', '996m'
'user-agent', 'grpc-node-js/1.9.11'
'content-type', 'application/grpc'
'te', 'trailers'
'x-forwarded-proto', 'http'
'x-request-id', '43c0d747-d726-4098-9676-c3439deddfec'
'x-envoy-decorator-operation', 'app.xxx-backend.svc.cluster.local:6002/*'
'x-envoy-peer-metadata-id', 'sidecar~10.zzz.zz.zz~yyy-frontend-bff-58b98b67b7-vzl4z.yyy-frontend~yyy-frontend.svc.cluster.local'
'x-envoy-expected-rq-timeout-ms', '996'
'x-envoy-attempt-count', '1'
        thread=46

2024-01-04T10:35:58.109362Z     debug   envoy router external/envoy/source/common/router/upstream_request.cc:563        [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] pool ready       thread=46

2024-01-04T10:35:58.123211Z     debug   envoy router external/envoy/source/common/router/router.cc:1479 [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] upstream headers complete: end_stream=false      thread=46

2024-01-04T10:35:58.123370Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:1863  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] encoding headers via codec (end_stream=false):
':status', '200'
'content-type', 'application/grpc'
'date', 'Thu, 04 Jan 2024 10:35:57 GMT'
'server', 'envoy'
        thread=46

2024-01-04T10:35:58.123731Z     debug   envoy router external/envoy/source/common/router/upstream_request.cc:470        [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] resetting pool request   thread=46

2024-01-04T10:35:58.123773Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:1893  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] encoding trailers via codec:
'grpc-status', '0'
'grpc-message', ''
        thread=46

2024-01-04T10:35:58.123813Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:243   [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] doEndStream() resetting stream   thread=46

2024-01-04T10:35:58.123832Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:1932  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] stream reset: reset reason: local reset, response details: -     thread=46

Envoy doc: envoy uses nghttp2 and it sticks to the HTTP2 spec: https://www.envoyproxy.io/docs/envoy/v1.28.0/faq/debugging/why_is_envoy_sending_http2_resets

Envoy uses to frame HTTP/2, nghttp2. nghttp2 has extremely good adherence to the HTTP/2 spec, but as many clients are not exactly as compliant, this mismatch can cause unexpected resets.

Envoy code: https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/router/upstream_request.cc#L470 -> https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/http/conn_manager_impl.cc#L227 -> https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/http/conn_manager_impl.cc#L1930 From above 'local reset' as reason -> https://github.com/envoyproxy/envoy/blob/release/v1.28/contrib/generic_proxy/filters/network/source/router/router.h#L33-34 stream was locally reset due to connection termination

So what I observed for a good connection:

Send a H2 HEADER to the gRPC server
Receive ACK for the H2 HEADER
Receive x number of H2 DATA streams

And what I observed for a failed connection:

Send a H2 HEADER to the gRPC server
And before the header is ACK, there is a RST_STREAM stream (likely generated by Envoy)
App is failing with error 500

Also I found an issue at the envoy which could be related: https://github.com/envoyproxy/envoy/issues/30149

I noted that the issues are starting with the grpc-js v.1.8.0.: https://github.com/grpc/grpc-node/releases/tag/%40grpc%2Fgrpc-js%401.8.0

Any possible idea why the grpc-js version v.1.8.0 (with or without retries) could affect Istio(envoy)?

Thank you in advance.

murgatroid99 commented 9 months ago

@ianebot As in https://github.com/grpc/grpc-node/issues/2569#issuecomment-1719872033, I believe a likely explanation is that some internal changes in 1.8 that were part of the retry implementation changed how outgoing network operations were sequenced in concurrent requests, and that this different sequencing results in different behavior from Envoy.

The fundamental problem here is that while Envoy may be compliant with the HTTP/2 spec in this scenario (though it is unclear to me how interrupting a stream without sending the expected response qualifies as success), it is not compliant with the gRPC spec. An HTTP/2 RST_STREAM frame with code 0 indicates that the stream is completing successfully, but gRPC does not consider a stream successful unless it ends with trailers with the header grpc-status 0.

The "local reset" reason from Envoy also seems vague to me. I don't think we can meaningfully debug this further without knowing why exactly Envoy decided to reset the stream.

ianebot commented 9 months ago

Hi @murgatroid99 ,

... As in https://github.com/grpc/grpc-node/issues/2569#issuecomment-1719872033, I believe a likely explanation is that some internal changes in 1.8 that were part of the retry implementation changed how outgoing network operations were sequenced in concurrent requests, and that this different sequencing results in different behavior from Envoy.

What surprise me is that we are not using the retries, any idea why without retries we should face the issues?

The fundamental problem here is that while Envoy may be compliant with the HTTP/2 spec in this scenario (though it is unclear to me how interrupting a stream without sending the expected response qualifies as success), it is not compliant with the gRPC spec. An HTTP/2 RST_STREAM frame with code 0 indicates that the stream is completing successfully, but gRPC does not consider a stream successful unless it ends with trailers with the header grpc-status 0.

From the logs I posted (see above) I can see envoy should actually provide this:

2024-01-04T10:35:58.123773Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:1893  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] encoding trailers via codec:
'grpc-status', '0'
'grpc-message', ''
        thread=46

2024-01-04T10:35:58.123813Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:243   [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] doEndStream() resetting stream   thread=46

2024-01-04T10:35:58.123832Z     debug   envoy http external/envoy/source/common/http/conn_manager_impl.cc:1932  [Tags: "ConnectionId":"849","StreamId":"11836221313583801119"] stream reset: reset reason: local reset, response details: -     thread=46

Here you can see more details from the traces (only grpc.client complained):

Server:

component: google.golang.org/grpc
language: go
rpc: grpc:
- full_method: /api.xxx_backend.ReadxxxDataService/GetxxxDataList
- kind: unary
- status_code: OK

Envoy:

component: proxy
language: cpp
node_id: sidecar~10.xxx.xx.104~yyy-frontend-bff~yyy-frontend.svc.cluster.local
grpc:
- content_type: application/grpc
- message: ""
- path: /api.xxx_backend.ReadxxxDataService/GetxxxDataList
- status_code: 0
http:
- host: app.xxx-backend
- method: POST
- path_group: /api.xxx_backend.ReadxxxDataService/GetxxxDataList
- protocol: HTTP/2
- status_code: 200
- useragent: grpc-node-js/1.9.11

Client: http.request:

component: http2
language: javascript
kind: client
http:
- host: app.xxx-backend
- method: POST
- path_group: /api.xxx_backend.ReadxxxDataService/GetxxxDataList
- status_code: 200
- url: http://app.xxx-backend:6002/api.xxx_backend.ReadxxxDataService/GetxxxDataList

Client: grpc.client:

component: grpc
language: javascript
kind: client
error message: 13 INTERNAL: Received RST_STREAM with code 0
rpc: grpc
kind: unary
path: /api.xxx_backend.ReadxxxDataService/GetxxxDataList
status_code: 13
method: GetxxxDataList
service: api.xxx_backend.ReadxxxDataService

murgatroid99 commented 9 months ago

You're right, I didn't notice that part of the Envoy log. So, according to Envoy, it is sending those trailers. Wireshark or something similar could show whether the trailers actually go out on the wire. If they do, then it is a client problem. Then my guess is that either the trailers are getting dropped somehow, or the events are getting reordered so that the client sees the RST_STREAM first. I don't think that reordering can be happening in the gRPC client code, but it could happen in the Node runtime.

What surprise me is that we are not using the retries, any idea why without retries we should face the issues?

I'm not saying that using retries causes this problem, I'm saying that the implementation of retries in 1.8.0 came with some internal changes that impacted how some operations are sequenced, and that that was my best guess for why this started with version 1.8.0.

niloc132 commented 8 months ago

Coming from https://github.com/envoyproxy/envoy/issues/30149, I can confirm that via wireshark, envoy is sometimes not sending the trailers (logs notwithstanding). We made a docker-compose sample with both python and java clients and servers try this out. Notably, we haven't reproduced the bug with the python grpc client, but my wild-guess there is that the cpp internals of the python client are letting envoy send payloads faster, so it doesn't get into this "bad state" where it stops forwarding enqueued messages. Note that the below observations are potentially specific to that linked issue, but it smells a whole lot like this issue, so possibly will help someone else out.

The larger the payload sent, the more likely this bug will happen - somewhere around 4MB is right to usually cause this on localhost. We sent payloads of increasing int32 values so that we could clearly see in wireshark where in the stream envoy stops forwarding data (confirmed by logging data on the client when things break). Note that we observed this with a python server as well, but didn't work too hard to test other clients after we got the wireshark data showing these results.

We have worked around this on the server by never half-closing data-heavy bidi streams until the client itself half-closes. This is naturally backwards incompatible (since an old client might believe it doesn't need to half-close first), but has prevented this issue for us. Server-streaming calls are unaffected, since the client always half-closes as soon as it sends its first message. In theory a client-streaming call could be affected, if the initial/only payload that the server sends was big enough? We didn't investigate further.

ianebot commented 7 months ago

Hi @murgatroid99,

I did some tests regarding your comments:

The fundamental problem here is that while Envoy may be compliant with the HTTP/2 spec in this scenario (though it is unclear to me how interrupting a stream without sending the expected response qualifies as success), it is not compliant with the gRPC spec. An HTTP/2 RST_STREAM frame with code 0 indicates that the stream is completing successfully, but gRPC does not consider a stream successful unless it ends with trailers with the header grpc-status 0. The "local reset" reason from Envoy also seems vague to me. I don't think we can meaningfully debug this further without knowing why exactly Envoy decided to reset the stream.

grpc-js v.1.7.3. with Istio

No issues, trailers were received and stream was closed successfully grpc-js-1-7-3-istio

grpc-js v.1.9.11 with Istio

Server provides all the data and right after that it sends a rst_stream. grpc-js-1-9-11-with-istio This rst_stream is actually triggering the "local reset" at the envoy: https://github.com/envoyproxy/envoy/blob/12e928ce06a8f0bf32b5e27ec1bda2368cddc920/source/common/router/upstream_request.cc#L475

And the client is confirming the rst_stream grpc-js-1-9-11-with-istio-2 My guess grpc-js will report the error 13 internal because of this: https://github.com/grpc/grpc-node/blob/f70c322e57a6aed4fea527d3268f6b736057e0f1/packages/grpc-js/src/subchannel-call.ts#L200

The server is a gRPC Go implementation, so I checked the grpc-go code and I found where it could reset: https://github.com/grpc/grpc-go/blob/76a23bf37a7c07f88ccb735e83b64ba7be4c14f1/internal/transport/http2_server.go#L1090

// Send a RST_STREAM after the trailers if the client has not already half-closed.

Any thought or idea what it could cause the reset?

murgatroid99 commented 7 months ago

@ianebot You mention that

This rst_stream is actually triggering the "local reset" at the envoy: https://github.com/envoyproxy/envoy/blob/12e928ce06a8f0bf32b5e27ec1bda2368cddc920/source/common/router/upstream_request.cc#L475

I assume this means that there is an Envoy proxy in between the client and the server. If I understand your pictures right, you show that the server sends trailers and a RST_STREAM to the Envoy proxy, and that the Envoy proxy sends the RST_STREAM to the client. If the Envoy proxy does not send the trailers, then that is the problem. This would match up with the information in https://github.com/grpc/grpc-node/issues/2569#issuecomment-1899554032, which points out that the existing Envoy bug envoyproxy/envoy#30149 shows a situation in which Envoy does not pass along trailers.

ianebot commented 7 months ago

Hi @murgatroid99,

thanks for a quick reply. Regarding your comments:

I assume this means that there is an Envoy proxy in between the client and the server.

That is correct. I did the tests under the same environment, setup and version of Istio(envoy) for both grpc-js versions 1.7.3 and 1.9.11.

If I understand your pictures right, you show that the server sends trailers and a RST_STREAM to the Envoy proxy, and that the Envoy proxy sends the RST_STREAM to the client. If the Envoy proxy does not send the trailers, then that is the problem.

An interesting fact is that I did also some load testing tests with go gRPC client (ghz), and I did not face any issue even with much higher load and using concurrency (i.e. 10000 req in grpc-go vs 10 req in grpc-js).

Any log suggestion to catch whether or not grpc-js received the trailers from Envoy?
Any log suggestion to display in which case the request status (http2Stream.rstCode) is falling into? _http2.constants.NGHTTP2_NOERROR vs default (https://github.com/grpc/grpc-node/blob/f70c322e57a6aed4fea527d3268f6b736057e0f1/packages/grpc-js/src/subchannel-call.ts#L199)

This would match up with the information in https://github.com/grpc/grpc-node/issues/2569#issuecomment-1899554032, which points out that the existing Envoy bug https://github.com/envoyproxy/envoy/issues/30149 shows a situation in which Envoy does not pass along trailers.

Any suggestion how 1.9.x could lead this random issues with envoy? It may help to understand in which scenario this is causing the issue.

murgatroid99 commented 7 months ago

The error this bug is about means that the client did not receive trailers and that it received an RST_STREAM with code 0 (no error).

Any suggestion how 1.9.x could lead this random issues with envoy? It may help to understand in which scenario this is causing the issue.

I presented a theory in a previous comment https://github.com/grpc/grpc-node/issues/2569#issuecomment-1883544814. The problem could also be related to https://github.com/envoyproxy/envoy/issues/30149 as mentioned in https://github.com/grpc/grpc-node/issues/2569#issuecomment-1899554032. But in general, I don't know why Envoy does what it does; the right people to ask about that are the Envoy maintainers.

yhpark commented 4 months ago

Seems like the issue is widely present on Envoy and Istio, is there any mitigation on the client side until the issue is resolved?

niloc132 commented 4 months ago

@yhpark from what we can tell, the only mitigation possible is for the server to avoid half-closing the stream from its side - just stop sending messages. This may break the semantics of your streams, though it did work for us.

grpc / grpc-node