grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.44k stars 640 forks source link

Error: 14 UNAVAILABLE: TCP Read failed #692

Open simoncirstoiu opened 5 years ago

simoncirstoiu commented 5 years ago

Problem description

Conversation Error: { Error: 14 UNAVAILABLE: TCP Read failed at Object.exports.createStatusError (/relay/node_modules/grpc/src/common.js:91:15) at ClientDuplexStream._emitStatusIfDone (/relay/node_modules/grpc/src/client.js:233:26) at ClientDuplexStream._receiveStatus (/relay/node_modules/grpc/src/client.js:211:8) at Object.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:1306:15) at InterceptingListener._callNext (/relay/node_modules/grpc/src/client_interceptors.js:568:42) at InterceptingListener.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:618:8) at /relay/node_modules/grpc/src/client_interceptors.js:1123:18 code: 14, metadata: Metadata { _internal_repr: {} }, details: 'TCP Read failed' }

Environment

Additional context

it happens randomly but it start doing that since ~1 month back

kyle-mccarthy commented 5 years ago

I am having this same issue on Google App Engine standard, node 10.

I am using @google-cloud/logging-winston which depends on @google-cloud/logging and that depends on @google-cloud/common-grpc which uses this package. So I can't say what is causing this issue.

Is this a problem with this package or how other packages implement it?

Error: 14 UNAVAILABLE: TCP Read failed at Object.exports.createStatusError (/srv/node_modules/grpc/src/common.js:87:15) at Object.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:1188:28) at InterceptingListener._callNext (/srv/node_modules/grpc/src/client_interceptors.js:564:42) at InterceptingListener.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:614:8) at callback (/srv/node_modules/grpc/src/client_interceptors.js:841:24)
sergioregueira commented 5 years ago

Related issue: https://github.com/googleapis/google-cloud-node/issues/2438

murgatroid99 commented 5 years ago

The status responses described here are expected behavior of gRPC itself. Generally, we expect that some responses, including UNAVAILABLE, should be retried. In these cases the library that uses gRPC should perform the retry logic, as described in that linked issue.

sergioregueira commented 5 years ago

@murgatroid99 the problem is gRPC library does not reconnect after that error, so applications stop sending logs to Stackdriver. When this happens, we have to replace the affected instance with a new one.

murgatroid99 commented 5 years ago

First, to make sure that we're on the same page, the way gRPC reconnection is supposed to work is that if a connection fails, then the next time the user starts a call the library starts trying to reestablish the connection. A user that retries failed UNAVAILABLE requests should trigger that reconnection.

So, are you saying that the Stackdriver library is retrying the failed request after the connection drops, and when it does so the gRPC library never reestablishes the connection? If so, that is a major bug that we are otherwise unaware of, so please file a full bug report with reproduction steps.

nicolasnoble commented 5 years ago

Emphasis on "if the library is retrying the failed request". The initial bug report seems to indicate this isn't the case.

alexander-fenster commented 5 years ago

To add some details here. Client libraries use gRPC via the gax layer, and that gax layer can send retries if the library is configured so. By 'configured so' I mean the JSON configuration file that is passed to client library constructor. If this file mentions that the given method is idempotent - i.e. it is safe to retry it - gax layer will do the retries. If it's non-idempotent, just like all POST methods, it won't be retried automatically and it should be taken care of by the caller.

In logging library, the configuration file is here. @crwilcox made a change 2 weeks ago to treat all the logging method as idempotent, which means that you might get a duplicated log line in case of a retry, but you should not get fails: https://github.com/googleapis/nodejs-logging/pull/366 It does not fix the TCP error that makes gRPC fail, but at least it should make gax retry the failed call (we are now talking only about logging libraries).

Let us make a new release of the logging library, it should fix most of the problems related to logging libraries. For other libraries, let's go one by one and see what's going on, but in general, the caller code should be ready to get UNAVAILABLE (it's a network, so any call can technically fail), and retrying UNAVAILABLE for non-idempotent methods is something your code should take care of.

sergioregueira commented 5 years ago

@murgatroid99 @nicolasnoble I am sorry for the delay. I was a bit busy these days.

The same day you asked me for more information, I enabled gRPC debugging setting some environment variables (GRPC_TRACE=tcp GRPC_VERBOSITY=DEBUG) and I monitored the logs during several days.

I could check, as you mentioned, that nodejs-logging tries to reconnect automatically when something fails. This is an example where it reconnects successfully (and logs continue being persisted):

I0120 19:12:12.874089209       7 tcp_custom.cc:128]          TCP:0x79c4ea52120 call_cb 0x79c4e995158 0x1537f52a69d0:0x79c4e995100
I0120 19:12:12.874262453       7 tcp_custom.cc:286]          TCP 0x79c4ea52120 shutdown why={"created":"@1548011532.874204613","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548011532.874198969","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548011532.874048422","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0120 19:12:12.874357814       7 dns_resolver.cc:279]        Start resolving.
I0120 19:13:20.597892586       7 tcp_client_custom.cc:150]   CLIENT_CONNECT: 0x79c4e86a890 ipv6:[2a00:1450:4013:c01::5f]:443: asynchronously connecting
I0120 19:13:20.698567576       7 tcp_custom.cc:348]          Creating TCP endpoint 0x79c4e86a890
I0120 19:13:20.698936427       7 tcp_custom.cc:217]          write complete on 0x79c4e86a890: error="No Error"

Nevertheless, I found some cases where the reconnection was not completed (and logs stopped):

I0120 23:47:28.792244320       7 tcp_custom.cc:128]          TCP:0x79c4e33d0a0 call_cb 0x79c4da4be78 0x1537f52a69d0:0x79c4da4be20
I0120 23:47:28.792345598       7 tcp_custom.cc:132]          read: error={"created":"@1548028048.792215883","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0120 23:47:28.792489454       7 tcp_custom.cc:286]          TCP 0x79c4e33d0a0 shutdown why={"created":"@1548028048.792444724","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548028048.792438609","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548028048.792215883","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0120 23:47:28.792592164       7 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***

I think the change done by @crwilcox will reduce the probability of lost logs due to request errors but the error that I have just detailed will continue happening.

By the way, thanks @alexander-fenster for your clear explanation!

sergioregueira commented 5 years ago

More examples where it does not reconnect:

I0118 20:49:56.469767934       6 tcp_custom.cc:128]          TCP:0x6714db57150 call_cb 0x6714dbdfb78 0x15de0f0a69d0:0x6714dbdfb20
I0118 20:49:56.469810638       6 tcp_custom.cc:132]          read: error={"created":"@1547844596.469749556","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"} 
I0118 20:49:56.469885894       6 tcp_custom.cc:234]          WRITE 0x6714db57150 (peer=ipv6:[2a00:1450:4013:c03::5f]:443): ***TRUNCATED***
I0118 20:49:56.469918688       6 tcp_custom.cc:217]          write complete on 0x6714db57150: error={"created":"@1547844596.469907967","description":"TCP Write failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0118 20:49:56.469978028       6 tcp_custom.cc:286]          TCP 0x6714db57150 shutdown why={"created":"@1547844596.469848883","description":"Delayed close due to in-progress write","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":609,"referenced_errors":[{"created":"@1547844596.469834452","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1547844596.469830392","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1547844596.469749556","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]},{"created":"@1547844596.469907967","description":"TCP Write failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}
D0118 20:49:56.470028601       6 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***
I0119 06:29:15.447665078       6 tcp_custom.cc:128]          TCP:0x710b1824f20 call_cb 0x710b19157d8 0x15fcf80a69d0:0x710b1915780
I0119 06:29:15.452015545       6 tcp_custom.cc:132]          read: error={"created":"@1547879355.447619730","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0119 06:29:15.452104651       6 tcp_custom.cc:286]          TCP 0x710b1824f20 shutdown why={"created":"@1547879355.452064405","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1547879355.452060353","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1547879355.447619730","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0119 06:29:15.452271642       6 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***
I0121 06:29:17.716285929       5 tcp_custom.cc:128]          TCP:0x7a5ecde6af0 call_cb 0x7a5eced6498 0x153cd0ca69d0:0x7a5eced6440
I0121 06:29:17.716381358       5 tcp_custom.cc:132]          read: error={"created":"@1548052157.716232153","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0121 06:29:17.716458612       5 tcp_custom.cc:286]          TCP 0x7a5ecde6af0 shutdown why={"created":"@1548052157.716404989","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548052157.716401286","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548052157.716232153","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0121 06:29:17.716564239       5 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***
crwilcox commented 5 years ago

I released a new version of the @google-cloud/logging package that will now retry when writing log entries on UNAVAILABLE. https://www.npmjs.com/package/@google-cloud/logging/v/4.3.0

For operations deemed non-idempotent, such as the case with things like WriteLogEntries, the RPC call that underlies many logging calls, it isn't necessarily safe for the client to retry. A decision was made in this case that duplicated logs are preferable to not logging. There unfortunately will exist other non-idempotent RPCs that will be more complicated.

murgatroid99 commented 5 years ago

@sergioregueira When you say that the logs ceased, do you mean that in each case, that is the last log line output from that process, and that the other logs come from different processes? If so, what is the behavior of those processes after they stop outputting logs?

Also, I don't know exactly what behavior of the gRPC library we are seeing here, but the expected behavior is not that the library always reconnects; it generally only reconnects if there is an existing pending request. Do any of the logs you have show that it is failing to reconnect while there is a pending request. If not, you may need to get more trace information to see that information. The api tracer should help, so you can modify the environment variables to have GRPC_TRACE=api,tcp. The full list of available tracers is in this document.

sergioregueira commented 5 years ago

@murgatroid99 I mean the application continues working properly but Stackdriver does not process more logs of the affected instance (including gRPC traces). Logs generated automatically (ie. appengine.googleapis.com/request_log) are not interrupted.

When that error happens I have to delete the affected instance because the logs generated by that instance after the error are not registered.

The logs I attached in my previous post is all I could obtain after the errors. I will update right now that environment variable to get more information. If you have additional suggestions to debug the problem just let me know.

murgatroid99 commented 5 years ago

I just want to make sure I am understanding this correctly. You are using the Stackdriver logger to save the gRPC logs generated from that same Stackdriver logger, which you are using to debug why the logger is failing?

sergioregueira commented 5 years ago

@murgatroid99

On the one hand, our application logs are sent to Stackdriver via nodejs-logging (that uses internally gRPC) in a structured way (using LogEntry.jsonPayload object).

On the other hand, gRPC logs are printed by the library itself to stdout/stderr (we did not have to code anything) and collected automatically by Stackdriver (as simple LogEntry.textPayload string).

In conclusion, we use logs that are NOT sent via gRPC to debug gRPC protocol.

murgatroid99 commented 5 years ago

Actually, it would probably help the most to run with GRPC_TRACE=all. That will tell us with the greatest granularity possible the very last things the gRPC library does before we stop getting logs.

sergioregueira commented 5 years ago

Good news! I have updated @google-cloud/logging package to 4.3.0 version one week ago and no more errors detected on integration, staging nor production environment since then. Thanks everyone :)

P.S: I will continue logging all gRPC logs a few more days just in case.

Jeff-Tian commented 5 years ago

Met the same issue with latest grpc npm package (npm install grpc@1.21.1).

image

fajardm commented 5 years ago

same issue

RomanVarhola commented 5 years ago

i have the same issue on aws

QuestionAndAnswer commented 5 years ago

Same issue on GKE

Sieabah commented 5 years ago

We started seeing this issue across our applications last Friday. It's as if once the connection drops the client never reconnects and tries to continue to use the dead connection.

jacohend commented 4 years ago

Looks like this often occurs when a connection drops in swarm. You can use endpoint-mode: dnsrr to fix this, but apparently this will nuke all published ports except those in host mode.

JasperWoo commented 4 years ago

We are seeing this issue in using firestore libraries in GCP.

Sieabah commented 4 years ago

We ended up having to fix this at the network level to just drop the connection.

stuckj commented 4 years ago

I'm seeing a similar, though not identical, error from grpc while using the firebase-admin JS library (which uses google/firestore). The exact error I'm seeing is:

12:17:53 PM: 2020-01-07T17:17:53.280Z e5cf6caf-cbc5-4cd8-a1fa-a73df0be9983 INFO Error processing request { Error: 14 UNAVAILABLE: Stream refused by server at Object.callErrorFromStatus (/var/task/node_modules/@grpc/grpc-js/build/src/call.js:30:26) at Http2CallStream.call.on (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:96:33) at Http2CallStream.emit (events.js:203:15) at process.nextTick (/var/task/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22) at process._tickCallback (internal/process/next_tick.js:61:11) code: 14, details: 'Stream refused by server', metadata: Metadata { internalRepr: Map {}, options: {} } }

So, stream refused by server. I would chalk this up to intermittent firestore errors, but we see this a few times an hour and currently have a pretty limited number of requests (only about 10 an hour or so). So, this is a pretty high percent error rate.

This is happening in a backend cloud function on AWS infrastructure. Specifically, we're using netlify which uses AWS lambda as their cloud function provider.

For now, I'm going to implement some retry logic on our cloud function to see if that resolves the issue or if the issue persists on any lambda instances in which it shows up.

I'll also enable the GRPC_TRACE=all GRPC_VERBOSITY=DEBUG environment variables to see if that yields any more detail.

I'll report back with whatever I find.

iamdhrooov commented 4 years ago

Also facing same issue on production code. Have setup a go grpc server and node grpc client. This is happening intermittently which is more of a pain to debug. I also tried to set keepalive params, timeouts on server and client. Please suggest a solution for this. Will also update if i find another workaround.

wdalmut commented 4 years ago

:+1: same issue on production code with Hyperledger Fabric v1.2

stuckj commented 4 years ago

I don't have useful logging to provide (complications with getting the debug logs to send to our log service). But, I can say that after implementing retry logic (retries up to 5 times with a fixed 100ms delay rather than exponential) the problem has completely gone away.

So, I think in my case, at least, the errors are considered normal by the firestore library and require the user to retry. /shrug

rodrigoreis22 commented 4 years ago

@stuckj I'm seeing the same pattern on our AWS Lambda functions using the firestore client (latest version).

I was able to reproduce almost 100% of the cases in our dev environment.

I think this is related to how the container works in AWS Lambda. For the first call (cold start lambda) the Firestore client is initialized and the data is written properly, if I do a second try right after, it works fine because the lambda is "warm". After that I wait ~5 minutes and try to post again, I got an error almost 100% of the times, this when I was the only one using the dev environment.

It looks like the persisted connection/or connection pool from Firestore is not being reused or not being able to be reused after the lambda function has no activity for a few minutes.

Then I try setting the provisioned concurrency for Lambda to try and keep it "warm".. after waiting 5 minutes and trying again I got a different problem, a timeout.

It looks like the connection pooling developed doesn't work very well in a serverless environment with ephemeral containers.

murgatroid99 commented 4 years ago

For anyone who is experiencing this problem, if you are using the @grpc/grpc-js library (this is probably the case if you are using one of the Google Cloud or Firebase API client libraries), I suggest updating your dependencies to pick up the newest version of the library, published a couple of days ago. That version has some changes to some of the logic that emits this kind of error, so you should not longer see it in some cases.

orgads commented 4 years ago

I suppose you refer to #1199? It did not fix this issue.

murgatroid99 commented 4 years ago

@orgads I think I was actually referring to #1251. I don't think #1199 is related to the problems reported here at all.

stefanotauriello commented 4 years ago

@stuckj I'm seeing the same pattern on our AWS Lambda functions using the firestore client (latest version).

I was able to reproduce almost 100% of the cases in our dev environment.

I think this is related to how the container works in AWS Lambda. For the first call (cold start lambda) the Firestore client is initialized and the data is written properly, if I do a second try right after, it works fine because the lambda is "warm". After that I wait ~5 minutes and try to post again, I got an error almost 100% of the times, this when I was the only one using the dev environment.

It looks like the persisted connection/or connection pool from Firestore is not being reused or not being able to be reused after the lambda function has no activity for a few minutes.

Then I try setting the provisioned concurrency for Lambda to try and keep it "warm".. after waiting 5 minutes and trying again I got a different problem, a timeout.

It looks like the connection pooling developed doesn't work very well in a serverless environment with ephemeral containers.

I'm using it in AWS Lambda, how can I fix it?

stuckj commented 4 years ago

Also, for sure you cannot create the firestore client outside of your lambda handler if you're using AWS lambda. The client must be created fresh for every request. The way lambda "freeze-dries" your lamdba functions when they go into a cold state will break all socket connections which will mess up the state for the firestore client (in my experience). I've had similar problems with Winston for logging and the solution is to initialize whatever object has the network connections inside of the lambda.

stefanotauriello commented 4 years ago

Thanks

Il giorno mer 20 mag 2020 alle 14:26 Jonathan Stucklen < notifications@github.com> ha scritto:

Also, for sure you cannot create the firestore client outside of your lambda handler if you're using AWS lambda. The client must be created fresh for every request. The way lambda "freeze-dries" your lamdba functions when they go into a cold state will break all socket connections which will mess up the state for the firestore client (in my experience). I've had similar problems with Winston for logging and the solution is to initialize whatever object has the network connections inside of the lambda.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-node/issues/692#issuecomment-631441230, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEOXWXKIJYNYZN3PYOJ3M3RSPD5RANCNFSM4GMSOCZQ .

-- Stefano Tauriello

stefano.tauriello@gmail.com

[image: facebook] https://www.facebook.com/stefano.tauriello[image: linkedin] https://www.linkedin.com/in/stefano-tauriello-9036784a/[image: twitter] https://twitter.com/stauriello[image: instagram] https://www.instagram.com/stefanotauriello/

ming2020 commented 4 years ago

Try reducing grpc to 1.22.2

raghuchahar007 commented 4 years ago

I am also having the same issue while doing smile query on firestore, { Error: 14 UNAVAILABLE: DNS resolution failed at Object.exports.createStatusError (/node_modules/grpc/src/common.js:91:15) at ClientReadableStream._emitStatusIfDone (/node_modules/grpc/src/client.js:233:26) at ClientReadableStream._readsDone (/node_modules/grpc/src/client.js:199:8) at /node_modules/grpc/src/client_interceptors.js:683:15

Version: firebase-admin: 7.0.0 Node: 6.X

Please help!

alexander-fenster commented 4 years ago

@raghuchahar007 what's your gRPC versions? Please post the output of npm ls grpc @grpc/grpc-js google-gax.

grahamjenson commented 4 years ago

We have a similar issue and by putting in the .npmrc put

build_from_source=grpc
grpc_alpn=false

I am hoping to find a way to do this without rebuilding, but this is my best guess.

dooleyb1 commented 4 years ago

Resolved personally by updating node modules FROM:

    "firebase-admin": "^8.7.0",
    "googleapis": "^42.0.0",

TO:

    "firebase-admin": "^9.1.1",
    "googleapis": "^59.0.0",
mayankpunetha007 commented 2 years ago

Having the same problem with: "@grpc/grpc-js": "^1.3.7", "@grpc/proto-loader": "^0.6.5"

Setup is: Local Node Server ---HTTPS----> Google Load Balancer ------HTTP2----> GRPC backend Always getting: Error: 14 UNAVAILABLE: Connection dropped

Any luck with this for anyone else?