Open gcxfd opened 3 years ago
What version of the library are you using?
"@grpc/grpc-js": "^1.3.0",
"google-protobuf": "^3.15.8"
as below screenshot , i saw twice this error , when connect idle for hours
Would you mind running your client with the environment variables GRPC_TRACE=call_stream
and GRPC_VERBOSITY=DEBUG
, and sharing the output from around when you get that error?
→ POST /xlsx
2021-05-08T07:45:59.154Z | call_stream | [6] Sending metadata
2021-05-08T07:45:59.155Z | call_stream | [6] write() called with message of length 29703
2021-05-08T07:45:59.155Z | call_stream | [6] end() called
2021-05-08T07:45:59.155Z | call_stream | Starting stream on subchannel 172.25.187.79:5301 with headers
grpc-accept-encoding: identity,deflate,gzip
accept-encoding: identity
:authority: 172.25.187.79:5301
user-agent: grpc-node-js/1.3.0
content-type: application/grpc
:method: POST
:path: /office.Office/diff
te: trailers
2021-05-08T07:45:59.155Z | call_stream | [6] attachHttp2Stream from subchannel 172.25.187.79:5301
2021-05-08T07:45:59.155Z | call_stream | [6] sending data chunk of length 29708
2021-05-08T07:45:59.156Z | call_stream | [6] calling end() on HTTP/2 stream
2021-05-08T07:45:59.160Z | call_stream | [5] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-05-08T07:45:59.160Z | call_stream | [6] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-05-08T07:45:59.161Z | call_stream | [5] HTTP/2 stream closed with code 2
2021-05-08T07:45:59.161Z | call_stream | [5] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET"
2021-05-08T07:45:59.161Z | call_stream | [6] HTTP/2 stream closed with code 2
2021-05-08T07:45:59.161Z | call_stream | [6] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET"
Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET
at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client.js:176:52)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /api/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (node:internal/process/task_queues:78:11)
{
code: 13,
details: 'Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET
at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client.js:176:52)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /api/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (node:internal/process/task_queues:78:11)
{
code: 13,
details: 'Received RST_STREAM with code 2 triggered by internal client error: read ECONNRESET',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
[object Object]
at log_err (file:///api/koa.js:77:13)
at file:///api/koa.js:201:7
at processTicksAndRejections (node:internal/process/task_queues:96:5)
[object Object]
at log_err (file:///api/koa.js:77:13)
at file:///api/koa.js:201:7
at processTicksAndRejections (node:internal/process/task_queues:96:5)
i restart server , then the first time use grpc sometimes throw this error
Thank you, that information is very helpful.
In the latest version, those errors should now be reported with the UNAVAILABLE
error code.
I think I'm having the same issue. Everything works OK as long as the connection is active. But after the connection's been idle for a while, all new requests hang for a long time (10+ minutes). Eventually this error is thrown:
DominoDbError: gRPC Client Error: INTERNAL Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT at new DominoDbError (/app/node_modules/@domino/domino-db/src/domino-db-error.js:6:16) at wrapError (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/grpc-helpers.js:169:10) at Object.callback (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/bulk-document.js:144:16) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:179:36) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141) at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181) at /app/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78 at processTicksAndRejections (internal/process/task_queues.js:77:11)
Using @grpc/grpc-js version 1.3.5. Does anyone know which older version (if any) behaves correctly without hanging for many minutes?
That hanging behavior seems different from other reports of similar problems, and the ETIMEDOUT
error code is new. Can you run your client with the environment variables GRPC_TRACE=all
and GRPC_VERBOSITY=DEBUG
and share the log output of a client that encounters that error?
Here is the detailed log.
2021-07-17T05:50:48.164Z | channel | dns:9.12.83.179:3003 createCall [3] method="/domino.Server/BulkNote", deadline=Infinity
2021-07-17T05:50:48.164Z | call_stream | [3] Sending metadata
2021-07-17T05:50:48.164Z | channel | Pick result: COMPLETE subchannel: 9.12.83.179:3003 status: undefined undefined
2021-07-17T05:50:48.165Z | call_stream | [3] write() called with message of length 159
2021-07-17T05:50:48.165Z | call_stream | [3] end() called
2021-07-17T05:50:48.165Z | call_stream | Starting stream on subchannel 9.12.83.179:3003 with headers
proto-sqn: 15
grpc-accept-encoding: identity,deflate,gzip
accept-encoding: identity
:authority: 9.12.83.179:3003
user-agent: grpc-node-js/1.2.12
content-type: application/grpc
:method: POST
:path: /domino.Server/BulkNote
te: trailers
2021-07-17T05:50:48.165Z | call_stream | [3] attachHttp2Stream from subchannel 9.12.83.179:3003
2021-07-17T05:50:48.166Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 0 -> 1
2021-07-17T05:50:48.166Z | call_stream | [3] sending data chunk of length 164
2021-07-17T05:50:48.166Z | call_stream | [3] calling end() on HTTP/2 stream
2021-07-17T05:50:48.254Z | call_stream | [3] Received server headers:
:status: 200
proto-sqn: 15
content-type: application/grpc
grpc-accept-encoding: identity,deflate,gzip
accept-encoding: identity,gzip
2021-07-17T05:50:48.254Z | call_stream | [3] receive HTTP/2 data frame of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] parsed message of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] filterReceivedMessage of length 616
2021-07-17T05:50:48.255Z | call_stream | [3] pushing to reader message of length 611
2021-07-17T05:50:48.257Z | call_stream | [3] Received server trailers:
grpc-status: 0
2021-07-17T05:50:48.257Z | call_stream | [3] received status code 0 from server
2021-07-17T05:50:48.257Z | call_stream | [3] ended with status: code=0 details=""
2021-07-17T05:50:48.257Z | call_stream | [3] close http2 stream with code 0
2021-07-17T05:50:48.258Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 1 -> 0
2021-07-17T05:50:48.258Z | call_stream | [3] HTTP/2 stream closed with code 0
getServerSideProps redp5386.html
2021-07-17T06:27:41.273Z | resolving_load_balancer | dns:9.12.83.179:3003 IDLE -> IDLE
2021-07-17T06:27:41.273Z | connectivity_state | dns:9.12.83.179:3003 IDLE -> IDLE
2021-07-17T06:27:41.273Z | dns_resolver | Resolver constructed for target dns:9.12.83.179:3003
2021-07-17T06:27:41.274Z | channel | dns:9.12.83.179:3003 createCall [4] method="/domino.Server/BulkNote", deadline=Infinity
2021-07-17T06:27:41.274Z | call_stream | [4] Sending metadata
2021-07-17T06:27:41.274Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.278Z | call_stream | [4] write() called with message of length 478
2021-07-17T06:27:41.278Z | call_stream | [4] end() called
2021-07-17T06:27:41.278Z | call_stream | [4] deferring writing data chunk of length 483
2021-07-17T06:27:41.279Z | dns_resolver | Resolution update requested for target dns:9.12.83.179:3003
2021-07-17T06:27:41.279Z | dns_resolver | Returning IP address for target dns:9.12.83.179:3003
2021-07-17T06:27:41.279Z | resolving_load_balancer | dns:9.12.83.179:3003 IDLE -> CONNECTING
2021-07-17T06:27:41.279Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.279Z | connectivity_state | dns:9.12.83.179:3003 IDLE -> CONNECTING
2021-07-17T06:27:41.279Z | resolving_load_balancer | dns:9.12.83.179:3003 CONNECTING -> CONNECTING
2021-07-17T06:27:41.279Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2021-07-17T06:27:41.279Z | connectivity_state | dns:9.12.83.179:3003 CONNECTING -> CONNECTING
2021-07-17T06:27:41.280Z | pick_first | Connect to address list 9.12.83.179:3003
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 3 -> 4
2021-07-17T06:27:41.280Z | pick_first | Pick subchannel with address 9.12.83.179:3003
2021-07-17T06:27:41.280Z | pick_first | IDLE -> READY
2021-07-17T06:27:41.280Z | resolving_load_balancer | dns:9.12.83.179:3003 CONNECTING -> READY
2021-07-17T06:27:41.280Z | channel | Pick result: COMPLETE subchannel: 9.12.83.179:3003 status: undefined undefined
2021-07-17T06:27:41.280Z | connectivity_state | dns:9.12.83.179:3003 CONNECTING -> READY
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 4 -> 5
2021-07-17T06:27:41.280Z | subchannel_refcount | 9.12.83.179:3003 refcount 5 -> 4
2021-07-17T06:27:41.281Z | call_stream | Starting stream on subchannel 9.12.83.179:3003 with headers
proto-sqn: 15
grpc-accept-encoding: identity,deflate,gzip
accept-encoding: identity
:authority: 9.12.83.179:3003
user-agent: grpc-node-js/1.2.12
content-type: application/grpc
:method: POST
:path: /domino.Server/BulkNote
te: trailers
2021-07-17T06:27:41.281Z | call_stream | [4] attachHttp2Stream from subchannel 9.12.83.179:3003
2021-07-17T06:27:41.281Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 0 -> 1
2021-07-17T06:27:41.282Z | call_stream | [4] sending data chunk of length 483 (deferred)
2021-07-17T06:27:41.282Z | call_stream | [4] calling end() on HTTP/2 stream
2021-07-17T06:44:35.889Z | call_stream | [4] Node error event: message=read ETIMEDOUT code=ETIMEDOUT errno=Unknown system error -110 syscall=read
2021-07-17T06:44:35.889Z | subchannel | 9.12.83.179:3003 connection closed with error read ETIMEDOUT
2021-07-17T06:44:35.890Z | subchannel | 9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 4 -> 3
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 3 -> 2
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | subchannel_refcount | 9.12.83.179:3003 refcount 2 -> 1
2021-07-17T06:44:35.890Z | pick_first | READY -> IDLE
2021-07-17T06:44:35.890Z | resolving_load_balancer | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.890Z | connectivity_state | dns:9.12.83.179:3003 READY -> IDLE
2021-07-17T06:44:35.891Z | call_stream | [4] HTTP/2 stream closed with code 2
2021-07-17T06:44:35.891Z | call_stream | [4] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT"
2021-07-17T06:44:35.891Z | subchannel_refcount | 9.12.83.179:3003 callRefcount 1 -> 0
caught bulkRead await error: DominoDbError: gRPC Client Error: INTERNAL Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT
caught bulkRead error stack: Error
at new DominoDbError (/app/node_modules/@domino/domino-db/src/domino-db-error.js:6:16)
at wrapError (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/grpc-helpers.js:169:10)
at Object.callback (/app/node_modules/@domino/domino-db/src/requests/grpc/utils/bulk-document.js:144:16)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:176:36)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:336:141)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /app/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (internal/process/task_queues.js:77:11)
2021-07-17T06:44:45.075Z | subchannel_refcount | 9.12.83.179:3003 refcount 1 -> 0
2021-07-17T06:44:45.075Z | subchannel | 9.12.83.179:3003 IDLE -> TRANSIENT_FAILURE
2021-07-17T06:44:45.075Z | subchannel | 9.12.83.179:3003 TRANSIENT_FAILURE -> IDLE
First, I want to note that the user-agent
header in the logs shows that you are actually using grpc-js version 1.2.12, not 1.3.5. I can't think of any change in between those that would impact your issue, but it would still probably simplify things if you update.
Are you running that code on a serverless platform? That would explain this behavior, but it will make it tricky to address the problem.
If you're not on serverless, you could also enable keepalives. You can do that by setting the channel option grpc.keepalive_time_ms
. That controls the frequency with which the client sends pings to keep the connection alive. You would also want to enable the grpc.keepalive_permit_without_calls
option by setting it to 1
.
Thanks for the update. Yes after I posted the log I realized I had already downgraded to grpc-js 1.2.12. I have been trying different versions from 1.2.1 through 1.3.5 and the behavior is exactly the same in all of them.
This is a node.js application utilizing the @domino/domino-db package by HCL to connect to a Domino database. gprc-js is a dependency to that package. Can I enable the keepalives via NODE_OPTIONS? Perhaps I need to contact the HCL team directly.
I'm not surprised that downgrading the dependency didn't help. I don't see any reason to think that this is a regression. This is just a fundamental problem where the client thinks the connection is still good until it tries to make a request and doesn't get a response.
These options are per-channel, so you can't set them with an environment variable. You probably will have to contact that other team so that they can allow you to set that option.
We have been facing the same issue quite often while using firestore, and the severity is very high as it ends up with data loss or unacceptable latency times most of the time. You can find detailed logs here if it'd help: https://github.com/googleapis/nodejs-firestore/issues/1023#issuecomment-894821062 @ googleapis/nodejs-firestore#1023
So I was also getting this issue and mine seemed to be a potential memory leak. I optimised my code so less stuff is stored in memory and the errors seems to stay away for now
@Pixelatex Thanks for sharing. Did you get any error messages or any indicator that the root cause is about memory when you were getting this issue?
Hi @murgatroid99, We realized that there are improvements about handling closed streams @grpc/grpc-js 1.3.7. Do you think upgrading to 1.3.7 will solve this issue as well?
The change I made in 1.3.7 just changes the error code that that particular error is surfaced with. It won't change the latency you're seeing because that's a much lower-level issue, where the client just isn't getting responses from the remote end, and it's waiting for the TCP timeout.
@murgatroid99 Thanks for your comment. As I mentioned in my previous comment, we're getting this error on firestore & firebase cloud functions. Do you think if the root cause is something firebase should fix, or it is an grpc-js related issue might be fixed in one of the next versions, or something we should fix in our client / cloud functions code? This issue became a blocker for us, so any help will be much appreciated, thanks.
If you're not on serverless, you could also enable keepalives. You can do that by setting the channel option grpc.keepalive_time_ms. That controls the frequency with which the client sends pings to keep the connection alive. You would also want to enable the grpc.keepalive_permit_without_calls option by setting it to 1.
@murgatroid99 what do you recommend seeting grpc.keepalive_time_ms
to?
This same issue is happening with us running Node v12.14 and grpc-node-js/1.3.7 :/
@pROFESOR11 This problem is likely a bad interaction between how gRPC handles connections and how Firebase Cloud Functions networking works. I don't expect to have a fix in the near future, but I can recommend the following to mitigate the problem:
grpc.keepalive_time_ms
and grpc.keepalive_timeout_ms
options to enable keepalive functionality. This will help the client detect that the server is unavailable more quickly. Importantly, on a serverless system like Firebase Cloud Functions, you should not set the option grpc.keepalive_permit_without_calls
.It seems that people commonly set grpc.keepalive_time_ms
to 10000 (10 seconds) and grpc.keepalive_timeout_ms
to 5000 (5 seconds), but the best values may change depending on the server you are talking to. For this mitigation to work optimally, the timeout you use to set the deadline should be at least slightly greater than the sum of those two values. One thing to watch for is if you start getting log lines that say "Connection to [address] rejected by server because of excess pings", that means that your setting for grpc.keepalive_time_ms
is too low.
I had exact same issue.
D 2022-05-25T09:19:37.476Z | call_stream | [14] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Protocol error"
I didn't think the contents of the .env
-file was deployed while I was deploying, but turns out firebase takes the contents of it:
FIRESTORE_EMULATOR_HOST="localhost:8080"
PUBSUB_EMULATOR_HOST="localhost:8085"
PUBSUB_PROJECT_ID="myrandomproject-local"
😱 These env-vars certainly shouldn't have been deployed and the error makes sense. That's why the Functions timed out when trying to connect with Firestore in my case. It is interesting that it times out rather than promptly failing.
I had the same issue as @hfossli-agens. In my Google Cloud Run service, process.env.NODE_ENV
wasn't being set to production
, so emulator env variables were being set and firebase-admin SDK couldn't connect to cloud services. 🤦♂️
Same thing here...
const isDev = process.env.ENVIRONMENT === 'development';
was not being set on local environment, so it was tracking the local errors from dev.
i reuse a connection for all request , sometime seems connection will disconnect when idle for a long times (client error: read ECONNRESET), but the second time visit will be ok
how to set grpc-js for auto reconnect to avoid this error ?