grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.4k stars 626 forks source link

Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET #1769

Open gcxfd opened 3 years ago

gcxfd commented 3 years ago

i reuse a connection for all request , sometime seems connection will disconnect when idle for a long times (client error: read ECONNRESET), but the second time visit will be ok

how to set grpc-js for auto reconnect to avoid this error ?

Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET
    at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client.js:176:52)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /api/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:76:11)
 {
  code: 13,
  details: 'Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
murgatroid99 commented 3 years ago

What version of the library are you using?

gcxfd commented 3 years ago
    "@grpc/grpc-js": "^1.3.0",
    "google-protobuf": "^3.15.8"

as below screenshot , i saw twice this error , when connect idle for hours

image

murgatroid99 commented 3 years ago

Would you mind running your client with the environment variables GRPC_TRACE=call_stream and GRPC_VERBOSITY=DEBUG, and sharing the output from around when you get that error?

gcxfd commented 3 years ago
→ POST /xlsx
2021-05-08T07:45:59.154Z | call_stream | [6] Sending metadata
2021-05-08T07:45:59.155Z | call_stream | [6] write() called with message of length 29703
2021-05-08T07:45:59.155Z | call_stream | [6] end() called
2021-05-08T07:45:59.155Z | call_stream | Starting stream on subchannel 172.25.187.79:5301 with headers
                grpc-accept-encoding: identity,deflate,gzip
                accept-encoding: identity
                :authority: 172.25.187.79:5301
                user-agent: grpc-node-js/1.3.0
                content-type: application/grpc
                :method: POST
                :path: /office.Office/diff
                te: trailers

2021-05-08T07:45:59.155Z | call_stream | [6] attachHttp2Stream from subchannel 172.25.187.79:5301
2021-05-08T07:45:59.155Z | call_stream | [6] sending data chunk of length 29708
2021-05-08T07:45:59.156Z | call_stream | [6] calling end() on HTTP/2 stream
2021-05-08T07:45:59.160Z | call_stream | [5] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-05-08T07:45:59.160Z | call_stream | [6] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-05-08T07:45:59.161Z | call_stream | [5] HTTP/2 stream closed with code 2
2021-05-08T07:45:59.161Z | call_stream | [5] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET"
2021-05-08T07:45:59.161Z | call_stream | [6] HTTP/2 stream closed with code 2
2021-05-08T07:45:59.161Z | call_stream | [6] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET"
Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET
    at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client.js:176:52)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /api/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
 {
  code: 13,
  details: 'Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET
    at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client.js:176:52)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
    at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /api/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
 {
  code: 13,
  details: 'Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
[object Object]
    at log_err (file:///api/koa.js:77:13)
    at file:///api/koa.js:201:7
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

[object Object]
    at log_err (file:///api/koa.js:77:13)
    at file:///api/koa.js:201:7
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

i restart server , then the first time use grpc sometimes throw this error

murgatroid99 commented 3 years ago

Thank you, that information is very helpful.

murgatroid99 commented 3 years ago

In the latest version, those errors should now be reported with the UNAVAILABLE error code.

PatrikHysky commented 3 years ago

I think I'm having the same issue. Everything works OK as long as the connection is active. But after the connection's been idle for a while, all new requests hang for a long time (10+ minutes). Eventually this error is thrown:

DominoDbError: gRPC Client Error: INTERNAL Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT at new DominoDbError (/app/node_modules/@domino/domino-db/src/domino-db-error.js:6:16) at wrapError (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/grpc-helpers.js:169:10) at Object.callback (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/bulk-document.js:144:16) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:179:36) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181) at /app/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78 at processTicksAndRejections (internal/process/task_queues.js:77:11)

Using @grpc/grpc-js version 1.3.5. Does anyone know which older version (if any) behaves correctly without hanging for many minutes?

murgatroid99 commented 3 years ago

That hanging behavior seems different from other reports of similar problems, and the ETIMEDOUT error code is new. Can you run your client with the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG and share the log output of a client that encounters that error?

PatrikHysky commented 3 years ago

Here is the detailed log.

  1. The request timestamped 05:50:48 completed normally
  2. The next request timestamped 06:27:41 returned no response and left the client hanging indefinitely. No error was thrown and the application effectively froze for over 16 minutes.
  3. Finally at 06:44:35 the code threw an error in the console and the application came back to life.
  4. Application started working normally again, processing hundreds of requests until the connection goes idle and the cycle repeats.
2021-07-17T05:50:48.164Z | channel | dns:9.12.83.179:3003 createCall [3] method="/domino.Server/BulkNote", deadline=Infinity
2021-07-17T05:50:48.164Z | call_stream | [3] Sending metadata
2021-07-17T05:50:48.164Z | channel | Pick result: COMPLETE subchannel: 9.12.83.179:3003 status: undefined undefined
2021-07-17T05:50:48.165Z | call_stream | [3] write() called with message of length 159
2021-07-17T05:50:48.165Z | call_stream | [3] end() called
2021-07-17T05:50:48.165Z | call_stream | Starting stream on subchannel 9.12.83.179:3003 with headers
        proto-sqn: 15
        grpc-accept-encoding: identity,deflate,gzip
        accept-encoding: identity
        :authority: 9.12.83.179:3003
        user-agent: grpc-node-js/1.2.12
        content-type: application/grpc
        :method: POST
        :path: /domino.Server/BulkNote
        te: trailers

2021-07-17T05:50:48.165Z | call_stream | [3] attachHttp2Stream from subchannel 9.12.83.179:3003
2021-07-17T05:50:48.166Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 0 -> 1
2021-07-17T05:50:48.166Z | call_stream | [3] sending data chunk of length 164
2021-07-17T05:50:48.166Z | call_stream | [3] calling end() on HTTP/2 stream
2021-07-17T05:50:48.254Z | call_stream | [3] Received server headers:
        :status: 200
        proto-sqn: 15
        content-type: application/grpc
        grpc-accept-encoding: identity,deflate,gzip
        accept-encoding: identity,gzip

2021-07-17T05:50:48.254Z | call_stream | [3] receive HTTP/2 data frame of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] parsed message of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] filterReceivedMessage of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] pushing to reader message of length 611
2021-07-17T05:50:48.257Z | call_stream | [3] Received server trailers:
        grpc-status: 0

2021-07-17T05:50:48.257Z | call_stream | [3] received status code 0 from server
2021-07-17T05:50:48.257Z | call_stream | [3] ended with status: code=0 details=""
2021-07-17T05:50:48.257Z | call_stream | [3] close http2 stream with code 0
2021-07-17T05:50:48.258Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 1 -> 0
2021-07-17T05:50:48.258Z | call_stream | [3] HTTP/2 stream closed with code 0
getServerSideProps redp5386.html
2021-07-17T06:27:41.273Z | resolving_load_balancer | dns:9.12.83.179:3003 IDLE -> IDLE
2021-07-17T06:27:41.273Z | connectivity_state | dns:9.12.83.179:3003 IDLE -> IDLE
2021-07-17T06:27:41.273Z | dns_resolver | Resolver constructed for target dns:9.12.83.179:3003
2021-07-17T06:27:41.274Z | channel | dns:9.12.83.179:3003 createCall [4] method="/domino.Server/BulkNote", deadline=Infinity
2021-07-17T06:27:41.274Z | call_stream | [4] Sending metadata
2021-07-17T06:27:41.274Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.278Z | call_stream | [4] write() called with message of length 478
2021-07-17T06:27:41.278Z | call_stream | [4] end() called
2021-07-17T06:27:41.278Z | call_stream | [4] deferring writing data chunk of length 483
2021-07-17T06:27:41.279Z | dns_resolver | Resolution update requested for target dns:9.12.83.179:3003
2021-07-17T06:27:41.279Z | dns_resolver | Returning IP address for target dns:9.12.83.179:3003
2021-07-17T06:27:41.279Z | resolving_load_balancer | dns:9.12.83.179:3003 IDLE -> CONNECTING
2021-07-17T06:27:41.279Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.279Z | connectivity_state | dns:9.12.83.179:3003 IDLE -> CONNECTING
2021-07-17T06:27:41.279Z | resolving_load_balancer | dns:9.12.83.179:3003 CONNECTING -> CONNECTING
2021-07-17T06:27:41.279Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.279Z | connectivity_state | dns:9.12.83.179:3003 CONNECTING -> CONNECTING
2021-07-17T06:27:41.280Z | pick_first | Connect to address list 9.12.83.179:3003
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 3 -> 4
2021-07-17T06:27:41.280Z | pick_first | Pick subchannel with address 9.12.83.179:3003
2021-07-17T06:27:41.280Z | pick_first | IDLE -> READY
2021-07-17T06:27:41.280Z | resolving_load_balancer | dns:9.12.83.179:3003 CONNECTING -> READY
2021-07-17T06:27:41.280Z | channel | Pick result: COMPLETE subchannel: 9.12.83.179:3003 status: undefined undefined
2021-07-17T06:27:41.280Z | connectivity_state | dns:9.12.83.179:3003 CONNECTING -> READY
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 4 -> 5
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 5 -> 4
2021-07-17T06:27:41.281Z | call_stream | Starting stream on subchannel 9.12.83.179:3003 with headers
        proto-sqn: 15
        grpc-accept-encoding: identity,deflate,gzip
        accept-encoding: identity
        :authority: 9.12.83.179:3003
        user-agent: grpc-node-js/1.2.12
        content-type: application/grpc
        :method: POST
        :path: /domino.Server/BulkNote
        te: trailers

2021-07-17T06:27:41.281Z | call_stream | [4] attachHttp2Stream from subchannel 9.12.83.179:3003
2021-07-17T06:27:41.281Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 0 -> 1
2021-07-17T06:27:41.282Z | call_stream | [4] sending data chunk of length 483 (deferred)
2021-07-17T06:27:41.282Z | call_stream | [4] calling end() on HTTP/2 stream
2021-07-17T06:44:35.889Z | call_stream | [4] Node error event: message=read ETIMEDOUT code=ETIMEDOUT errno=Unknown system error -110 syscall=read
2021-07-17T06:44:35.889Z | subchannel | 9.12.83.179:3003 connection closed with error read ETIMEDOUT
2021-07-17T06:44:35.890Z | subchannel | 9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 4 -> 3
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 3 -> 2
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 2 -> 1
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.891Z | call_stream | [4] HTTP/2 stream closed with code 2
2021-07-17T06:44:35.891Z | call_stream | [4] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT"
2021-07-17T06:44:35.891Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 1 -> 0
caught bulkRead await error: DominoDbError: gRPC Client Error: INTERNAL Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT
caught bulkRead error stack: Error
    at new DominoDbError (/app/node_modules/@domino/domino-db/src/domino-db-error.js:6:16)
    at wrapError (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/grpc-helpers.js:169:10)
    at Object.callback (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/bulk-document.js:144:16)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:176:36)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /app/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (internal/process/task_queues.js:77:11)
2021-07-17T06:44:45.075Z | subchannel_refcount | 9.12.83.179:3003 refcount 1 -> 0
2021-07-17T06:44:45.075Z | subchannel | 9.12.83.179:3003 IDLE -> TRANSIENT_FAILURE
2021-07-17T06:44:45.075Z | subchannel | 9.12.83.179:3003 TRANSIENT_FAILURE -> IDLE
murgatroid99 commented 2 years ago

First, I want to note that the user-agent header in the logs shows that you are actually using grpc-js version 1.2.12, not 1.3.5. I can't think of any change in between those that would impact your issue, but it would still probably simplify things if you update.

Are you running that code on a serverless platform? That would explain this behavior, but it will make it tricky to address the problem.

If you're not on serverless, you could also enable keepalives. You can do that by setting the channel option grpc.keepalive_time_ms. That controls the frequency with which the client sends pings to keep the connection alive. You would also want to enable the grpc.keepalive_permit_without_calls option by setting it to 1.

PatrikHysky commented 2 years ago

Thanks for the update. Yes after I posted the log I realized I had already downgraded to grpc-js 1.2.12. I have been trying different versions from 1.2.1 through 1.3.5 and the behavior is exactly the same in all of them.

This is a node.js application utilizing the @domino/domino-db package by HCL to connect to a Domino database. gprc-js is a dependency to that package. Can I enable the keepalives via NODE_OPTIONS? Perhaps I need to contact the HCL team directly.

murgatroid99 commented 2 years ago

I'm not surprised that downgrading the dependency didn't help. I don't see any reason to think that this is a regression. This is just a fundamental problem where the client thinks the connection is still good until it tries to make a request and doesn't get a response.

These options are per-channel, so you can't set them with an environment variable. You probably will have to contact that other team so that they can allow you to set that option.

pROFESOR11 commented 2 years ago

We have been facing the same issue quite often while using firestore, and the severity is very high as it ends up with data loss or unacceptable latency times most of the time. You can find detailed logs here if it'd help: https://github.com/googleapis/nodejs-firestore/issues/1023#issuecomment-894821062 @ googleapis/nodejs-firestore#1023

Pixelatex commented 2 years ago

So I was also getting this issue and mine seemed to be a potential memory leak. I optimised my code so less stuff is stored in memory and the errors seems to stay away for now

pROFESOR11 commented 2 years ago

@Pixelatex Thanks for sharing. Did you get any error messages or any indicator that the root cause is about memory when you were getting this issue?

pROFESOR11 commented 2 years ago

Hi @murgatroid99, We realized that there are improvements about handling closed streams @grpc/grpc-js 1.3.7. Do you think upgrading to 1.3.7 will solve this issue as well?

murgatroid99 commented 2 years ago

The change I made in 1.3.7 just changes the error code that that particular error is surfaced with. It won't change the latency you're seeing because that's a much lower-level issue, where the client just isn't getting responses from the remote end, and it's waiting for the TCP timeout.

pROFESOR11 commented 2 years ago

@murgatroid99 Thanks for your comment. As I mentioned in my previous comment, we're getting this error on firestore & firebase cloud functions. Do you think if the root cause is something firebase should fix, or it is an grpc-js related issue might be fixed in one of the next versions, or something we should fix in our client / cloud functions code? This issue became a blocker for us, so any help will be much appreciated, thanks.

roscioli commented 2 years ago

If you're not on serverless, you could also enable keepalives. You can do that by setting the channel option grpc.keepalive_time_ms. That controls the frequency with which the client sends pings to keep the connection alive. You would also want to enable the grpc.keepalive_permit_without_calls option by setting it to 1.

@murgatroid99 what do you recommend seeting grpc.keepalive_time_ms to?

This same issue is happening with us running Node v12.14 and grpc-node-js/1.3.7 :/

murgatroid99 commented 2 years ago

@pROFESOR11 This problem is likely a bad interaction between how gRPC handles connections and how Firebase Cloud Functions networking works. I don't expect to have a fix in the near future, but I can recommend the following to mitigate the problem:

It seems that people commonly set grpc.keepalive_time_ms to 10000 (10 seconds) and grpc.keepalive_timeout_ms to 5000 (5 seconds), but the best values may change depending on the server you are talking to. For this mitigation to work optimally, the timeout you use to set the deadline should be at least slightly greater than the sum of those two values. One thing to watch for is if you start getting log lines that say "Connection to [address] rejected by server because of excess pings", that means that your setting for grpc.keepalive_time_ms is too low.

hfossli-agens commented 2 years ago

I had exact same issue.

D 2022-05-25T09:19:37.476Z | call_stream | [14] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Protocol error" 

I didn't think the contents of the .env-file was deployed while I was deploying, but turns out firebase takes the contents of it:

FIRESTORE_EMULATOR_HOST="localhost:8080"
PUBSUB_EMULATOR_HOST="localhost:8085"
PUBSUB_PROJECT_ID="myrandomproject-local"

😱 These env-vars certainly shouldn't have been deployed and the error makes sense. That's why the Functions timed out when trying to connect with Firestore in my case. It is interesting that it times out rather than promptly failing.

johnnyoshika commented 10 months ago

I had the same issue as @hfossli-agens. In my Google Cloud Run service, process.env.NODE_ENV wasn't being set to production, so emulator env variables were being set and firebase-admin SDK couldn't connect to cloud services. 🤦‍♂️

BrodaNoel commented 2 months ago

Same thing here... const isDev = process.env.ENVIRONMENT === 'development'; was not being set on local environment, so it was tracking the local errors from dev.