grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.49k stars 649 forks source link

ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] #1907

Open Arun-KumarH opened 3 years ago

Arun-KumarH commented 3 years ago

Problem description

We experience intermittent connection reset errors migrating to grpc-js version "1.3.2". We are creating the Grpc client instance every time when message is processed. On the client side we see the stack trace ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] and no log messages on the server side although GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG on server side as well. We have attached the debug logs seen on the client side. This issue is seen in our production k8s environment and it is intermittent.

Environment

Additional context

Client Logging:

2021-09-19T16:53:33.253Z | call_stream | [79] write() called with message of length 34
2021-09-19T16:53:33.255Z | call_stream | [79] end() called
2021-09-19T16:53:33.258Z | call_stream | [79] deferring writing data chunk of length 39
2021-09-19T16:53:33.263Z | dns_resolver | Resolved addresses for target dns:x-des:50051: [x.x.x.x:50051]
2021-09-19T16:53:33.264Z | pick_first | Connect to address list x.x.x.x:50051
2021-09-19T16:53:33.264Z | subchannel_refcount | x.x.x.x:50051 refcount 4 -> 5
2021-09-19T16:53:33.264Z | pick_first | Pick subchannel with address x.x.x.x:50051
2021-09-19T16:53:33.264Z | pick_first | IDLE -> READY
2021-09-19T16:53:33.264Z | resolving_load_balancer | dns:x-des:50051 CONNECTING -> READY
2021-09-19T16:53:33.264Z | channel | callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
2021-09-19T16:53:33.265Z | connectivity_state | dns:x-des:50051 CONNECTING -> READY
2021-09-19T16:53:33.265Z | subchannel_refcount | x.x.x.x:50051 refcount 5 -> 6
2021-09-19T16:53:33.265Z | subchannel_refcount | x.x.x.x:50051 refcount 6 -> 5
2021-09-19T16:53:33.265Z | channel | Pick result: COMPLETE subchannel: x.x.x.x:50051 status: undefined undefined
2021-09-19T16:53:33.266Z | call_stream | Starting stream on subchannel x.x.x.x:50051 with headers
                grpc-timeout: 99946m
                grpc-accept-encoding: identity,deflate,gzip
                accept-encoding: identity
                :authority: x-des:50051
                user-agent: grpc-node-js/1.3.7
                content-type: application/grpc
                :method: POST
                :path: /io.xingular.device.Service/Read
                te: trailers

2021-09-19T16:53:33.266Z | call_stream | [79] attachHttp2Stream from subchannel x.x.x.x:50051
2021-09-19T16:53:33.266Z | subchannel_refcount | x.x.x.x:50051 callRefcount 0 -> 1
2021-09-19T16:53:33.266Z | call_stream | [79] sending data chunk of length 39 (deferred)
2021-09-19T16:53:33.266Z | call_stream | [79] calling end() on HTTP/2 stream
2021-09-19T16:53:33.273Z | call_stream | [79] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-09-19T16:53:33.273Z | subchannel | x.x.x.x:50051 connection closed with error read ECONNRESET
2021-09-19T16:53:33.274Z | subchannel | x.x.x.x:50051 connection closed
2021-09-19T16:53:33.274Z | subchannel | x.x.x.x:50051 READY -> IDLE
2021-09-19T16:53:33.274Z | subchannel_refcount | x.x.x.x:50051 refcount 5 -> 4
2021-09-19T16:53:33.274Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.275Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | subchannel_refcount | x.x.x.x:50051 refcount 4 -> 3
2021-09-19T16:53:33.275Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.275Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | subchannel_refcount | x.x.x.x:50051 refcount 3 -> 2
2021-09-19T16:53:33.276Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.276Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | subchannel_refcount | x.x.x.x:50051 refcount 2 -> 1
2021-09-19T16:53:33.276Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.276Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.277Z | call_stream | [79] HTTP/2 stream closed with code 2
2021-09-19T16:53:33.277Z | call_stream | [79] ended with status: code=14 details="read ECONNRESET"
2021-09-19T16:53:33.278Z | subchannel_refcount | x.x.x.x:50051 callRefcount 1 -> 0
^[[31merror^[[39m: 2021-09-19T16:53:33.278Z: Error serving unary request 14 UNAVAILABLE: read ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}]
murgatroid99 commented 3 years ago

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

In addition, you say "We are creating the Grpc client instance every time when message is processed." We recommend not doing that. You can just use one client for all of your requests. Using multiple clients isn't even accomplishing anything for you right now. They're all using the same underlying connection anyway.

gillsoftab commented 3 years ago

Try to downgrade to Node version 16.8.0 and see if that helps. We tested Node version 16.9.0 and 16.9.1 with no luck. Downgrading to Node version 16.8.0 works without any problems.

vanthome commented 3 years ago

@gillsoftab what exact issues did you have with recent node versions?

gillsoftab commented 3 years ago

Node version 16.9.0 and 16.9.1 same as described in https://github.com/nodejs/node/issues/39683 we also detected issues with an app that is using template (node add-on in c++).

Jaypov commented 3 years ago

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

In addition, you say "We are creating the Grpc client instance every time when message is processed." We recommend not doing that. You can just use one client for all of your requests. Using multiple clients isn't even accomplishing anything for you right now. They're all using the same underlying connection anyway.

I am having the same issue as this, i am in the process of downgrading my docker containers to use node 16.8.0 as suggested. However it is probably a good idea to have retires.

could anyone let me know if there is a way to do this inside @grpc/grpc-js config options. looking a GRPC .net library i can see there is a MaxAttempts = 5 param. does anyone know if there is something similar in @grpc/grpc-js library?

all i can see in this library is "grpc.initial_reconnect_backoff_ms" under channelOptions so my assumption would be it does retries by default, however it does not seem like this is the case

murgatroid99 commented 3 years ago

all i can see in this library is "grpc.initial_reconnect_backoff_ms" under channelOptions so my assumption would be it does retries by default, however it does not seem like this is the case

Connections are separate things for requests. This is one of the options that controls how the channel re-establishes connections after they are dropped. The @grpc/grpc-js library does not currently have any built-in support for doing requests automatically, so you will need to do that yourself. For example, you could use an interceptor, as described in the interceptor spec's examples section.

Jaypov commented 3 years ago

Thanks for the info and the link. Just an FYI to the rest of thee thread, downgrading node to 16.8.0 did not resolve the ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] issue for me. It is still happening sporadically. What i am finding odd is when developing locally, i never get this error (i develop locally with with node 16.8.0). When i deploy my services with docker and docker swarm, this is when i get the issue. I have 9 services and i only get this issue with 2 of them, i am yet to see it happen with any of the others

i am using@grpc/grpc-js: 1.3.7

Jaypov commented 3 years ago

I am using node 16.8.0 and i have downgrade too

    "@grpc/grpc-js": "1.2.1",
    "@grpc/proto-loader": "0.5.1",

and now the ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] has seemed to have disappear for me.

murgatroid99 commented 3 years ago

One of the 1.3.x releases changed how ECONNRESET errors were reported. In 1.2.x, they had the error code INTERNAL instead of UNAVAILABLE. Are you getting any INTERNAL errors with your new setup?

Jaypov commented 3 years ago

Ah damn, yes i am, it has just happend sporadically again. do you think it could be an issue with NestJs (the framework i use). It uses grpc-js under the hood. https://docs.nestjs.com/microservices/grpc

I manage the implement a retry interceptor as shown HERE However it hangs. The if statement condition will resolve to true and retry will be called, but then it will just hang. no response from onReceiveMessage or onReceiveStatus

          if (response.code !== status.OK) {
            console.log('CONNECTION FAILED: retry... ' + retries);
            retry(savedSendMessage, savedMetadata);
          } else {
            console.log('ALL GOOD');
            savedMessageNext(savedReceiveMessage);
            next(response);
          } 

At this point, there seems to be no quick fix solution. not sure if it is a library issue or a much deeper issue.

murgatroid99 commented 3 years ago

ECONNRESET is a relatively low-level network error. It's unlikely that it's caused by the framework.

You might want to try this version of the retry interceptor, which was actually tested. Just remove the code that references the registry, because that's there for the test.

Jaypov commented 3 years ago

That makes sense, thanks i'll give it a try. Thanks for all the help, appreciate it, been pulling my hair out.

Jaypov commented 3 years ago

Hey, so i have been poking around a lot today and whilst i do not have an answer i have some findings.

I use docker-compose to bring up all the microservice inside docker containers. When I run these containers on my local machine all the microservices work fine (never get [{"message":"14 UNAVAILABLE: read ECONNRESET"}])`

However when running them on my development network (AWS EC2 instance with all the containers on the same 1 VM) I do sporadically get the [{"message":"14 UNAVAILABLE: read ECONNRESET"}])`. It seems that I get the error randomly, then when i make a call again it will work multiple times. Then if i wait some time I will get the error again.

I enabled grpc-js debugging on the development network and i noticed this SSL error 150] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read

This lead me to ssh'ing into the docker containers on my development network. and using this command line tool to try and connect to one microservice from another. https://github.com/fullstorydev/grpcurl

I found that if i send a message using -plaintext it successfully connects to the microservice every single time. If I use TLS I get first record does not look like a TLS handshake (which is an error from the command line tool).

just out of curiosity do we know if grpc-js has some sort of fallback for if TLS fails?

murgatroid99 commented 3 years ago

just out of curiosity do we know if grpc-js has some sort of fallback for if TLS fails?

That's not a thing, in general. If you instruct the client to connect over TLS, it will connect over TLS, or not all.

If the problem is that the response you're getting is not a TLS handshake, are you sure that you are connecting to a TLS server? The port that is serving the plaintext service can't also be serving TLS; are you sure you are connecting to the correct TLS server port when trying to make a request using TLS?

Jaypov commented 3 years ago

I am unsure, the instantiation of the grpc-js client happens under the hood in the nest-js framework. ill dig into the code and take a look.

on another note, i think this issue may possibly be cause by docker swarm, at this point i am just going through a process of elimination. I have booted a new EC2 instance and installed all of my containers with just docker-compose and the error has disappeared. I am just in the process of booting up another instance, installing with docker compose and then putting a load balancer in front of the 2 instances (this is my current setup but without swarm) to see if the issues comes back.

murgatroid99 commented 3 years ago

The questions I asked are not even necessarily about grpc-js. You said

This lead me to ssh'ing into the docker containers on my development network. and using this command line tool to try and connect to one microservice from another. https://github.com/fullstorydev/grpcurl

I found that if i send a message using -plaintext it successfully connects to the microservice every single time. If I use TLS I get first record does not look like a TLS handshake (which is an error from the command line tool).

What ports did you connect to for these two tests? On the server you connected to, what port is serving in plaintext and what port is serving in TLS?

brunor2 commented 3 years ago

Hey @Jaypov, I just want to know if you were able to fix it with that change in docker.

Thanks!

Siddhesh-Swami commented 2 years ago

We also face the same issue. Sparingly getting the "message":"14 UNAVAILABLE: read ECONNRESET".

some observations by me : Client Env: Nodejs version:14.16.1 @grpc/grpc-js: 1.3.8 and @grpc/proto-loader: 0.5.6

in some clients: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

Server Env: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

experience the "message":"14 UNAVAILABLE: read ECONNRESET". issue sparingly.


But in one microservice client: the client we use nestjs : Nodejs version:14.16.1 +-- @grpc/proto-loader@0.5.6 +-- grpc@1.24.6

Server: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

never seen read ECONNRESET once in the logs.


Siddhesh-Swami commented 2 years ago

migrated all nodejs services to: @grpc/grpc-js: 1.2.4 and @grpc/proto-loader: 0.5.6 but still face this issue

murgatroid99 commented 2 years ago

To new commenters, see my first response on this issue:

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

tomasholicky commented 2 years ago

We have same problem.

@murgatroid99 Do you have plan to implement interceptor (or something different) in next releases to grpc-js?

murgatroid99 commented 2 years ago

The client interceptor API has been available in grpc-js since before the 1.0 release.

Siddhesh-Swami commented 2 years ago

Hello, opened a new issue for read ECONNRESET. #1994

Arun-KumarH commented 2 years ago

2115 we are using the retry interceptor and it works for Unary requests, but for client streaming after the first chunk is sent the connection is closed and we receive back grpc response.

We followed this version of interceptor

Arun-KumarH commented 2 years ago

2115 This is a pressing issue for us, any updates how we can use the retry interceptor for streaming use cases ?

bastienlemaitre commented 2 years ago

Still no answer ?

railsonluna commented 2 years ago

Still no answer ?

felipe-huszar commented 2 years ago

Still no answer? We are also having this issue, its quite frustrating because we cant replicate it

rochecompaan commented 1 year ago

I'm running into the same issue with streaming requests, but I have additional info that might help. I'm specifically using the retry interceptor which closely matches the one from the tests in implementation.

I see the following traceback for streaming requests:

Trace: Request path: /object_attributes.object_attributes_grpc/ListAssortmentObjectAttributes; response status code: 0; not eligible for retry.
    at Object.onReceiveStatus (/workspace/atlas/common/lib/client/interceptors.ts:135:41)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:206:19)
    at Object.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)
    at /workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:334:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
savedMessageNext: undefined
TypeError: savedMessageNext is not a function
    at Object.onReceiveStatus (/workspace/atlas/common/lib/client/interceptors.ts:141:33)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:206:19)
    at Object.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)
    at /workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:334:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
[ERROR] 10:16:28 TypeError: savedMessageNext is not a function

Looking at InterceptingListenerImpl, it seems that a next method is not passed to nextListener which would result in savedMessageNext being undefined.

I had to link to the 1.7 branch above. I was surprised to see call-stream.ts is not on master anymore, and then I noticed work on the retry implementation by @murgatroid99. I would gladly try this out on our stack once it is ready and thanks for the great work!

HofmannZ commented 1 year ago

This works for us:

const channelOptions: ChannelOptions = {
  // Send keepalive pings every 10 seconds, default is 2 hours.
  'grpc.keepalive_time_ms': 10 * 1000,
  // Keepalive ping timeout after 5 seconds, default is 20 seconds.
  'grpc.keepalive_timeout_ms': 5 * 1000,
  // Allow keepalive pings when there are no gRPC calls.
  'grpc.keepalive_permit_without_calls': 1,
};

✌️

ANANDD18 commented 1 year ago

This is my output when I try to install ESP32 library by Espressif systems. PLEASE HELP!


Downloading packages esp32:xtensa-esp32-elf-gcc@esp-2021r2-patch5-8.4.0 esp32:xtensa-esp32s2-elf-gcc@esp-2021r2-patch5-8.4.0 esp32:xtensa-esp32s3-elf-gcc@esp-2021r2-patch5-8.4.0 esp32:riscv32-esp-elf-gcc@esp-2021r2-patch5-8.4.0 esp32:openocd-esp32@v0.11.0-esp32-20221026 esp32:esptool_py@4.5.1 esp32:mkspiffs@0.2.3 esp32:mklittlefs@3.0.0-gnu12-dc7f933 esp32:esp32@2.0.7 Installing esp32:xtensa-esp32-elf-gcc@esp-2021r2-patch5-8.4.0 Configuring tool. esp32:xtensa-esp32-elf-gcc@esp-2021r2-patch5-8.4.0 installed Installing esp32:xtensa-esp32s2-elf-gcc@esp-2021r2-patch5-8.4.0 Configuring tool. esp32:xtensa-esp32s2-elf-gcc@esp-2021r2-patch5-8.4.0 installed Installing esp32:xtensa-esp32s3-elf-gcc@esp-2021r2-patch5-8.4.0 Configuring tool. esp32:xtensa-esp32s3-elf-gcc@esp-2021r2-patch5-8.4.0 installed Installing esp32:riscv32-esp-elf-gcc@esp-2021r2-patch5-8.4.0 Configuring tool. esp32:riscv32-esp-elf-gcc@esp-2021r2-patch5-8.4.0 installed Installing esp32:openocd-esp32@v0.11.0-esp32-20221026 Configuring tool. esp32:openocd-esp32@v0.11.0-esp32-20221026 installed Installing esp32:esptool_py@4.5.1 Configuring tool. esp32:esptool_py@4.5.1 installed Installing esp32:mkspiffs@0.2.3 Configuring tool. esp32:mkspiffs@0.2.3 installed Installing esp32:mklittlefs@3.0.0-gnu12-dc7f933 Configuring tool. esp32:mklittlefs@3.0.0-gnu12-dc7f933 installed Installing platform esp32:esp32@2.0.7 Failed to install platform: esp32:esp32. Error: 14 UNAVAILABLE: read ECONNRESET


nixjke commented 1 year ago

I set the url parameters to localhost:3000 And it works for me

export const grpcClientOptions: ClientOptions = {
  transport: Transport.GRPC,
  options: {
    package: 'hero',
    protoPath: join(__dirname, './hero/hero.proto'),
    url: 'localhost:3000',
  },
}

image

oddcitizenape commented 6 months ago

This works for us:

const channelOptions: ChannelOptions = {
  // Send keepalive pings every 10 seconds, default is 2 hours.
  'grpc.keepalive_time_ms': 10 * 1000,
  // Keepalive ping timeout after 5 seconds, default is 20 seconds.
  'grpc.keepalive_timeout_ms': 5 * 1000,
  // Allow keepalive pings when there are no gRPC calls.
  'grpc.keepalive_permit_without_calls': 1,
};

✌️

apply this configuration changed the error meesage I receive to: Error: 14 UNAVAILABLE: Connection dropped, but didn't resolve it. I'm getting it on a client which is streaming values from server after approx. 25 min. Client is implemented in grpc-js, server is implemneted in golang

Client: "@grpc/grpc-js": "^1.10.1", "@grpc/proto-loader": "^0.7.10", nodejs version: 20.5.0

Server: go 1.21.1 google.golang.org/grpc v1.58.1 google.golang.org/protobuf v1.31.0

UPDATE: after settings the server keepalive policy and settings, everything seems to work fine:

kaPolicy := grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             10 * time.Second,
        PermitWithoutStream: true, // Allow ping without stream
})

kaParams := grpc.KeepaliveParams(keepalive.ServerParameters{
        Time: 10 * time.Second, // The default value is 2 hours.
        // After having pinged for keepalive check, the server waits for a duration
        // of Timeout and if no activity is seen even after that the connection is
        // closed.
        Timeout: 5 * time.Second, // The default value is 20 seconds.
})

UPDATE2: nope, didn't help: Error: 14 UNAVAILABLE: Connection dropped

alvincrisuy commented 4 months ago
2024-07-17T14:07:34.294Z    38bfa3f1-4ebf-5d89-b784-fce09cdf9fa5    INFO    error Error: 14 UNAVAILABLE: read ECONNRESET
    at callErrorFromStatus (/var/task/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
    at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:193:76)
    at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/var/task/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /var/task/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
for call at
    at ServiceClientImpl.makeUnaryRequest (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:161:32)
    at ServiceClientImpl.<anonymous> (/var/task/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
    at Observable._subscribe (/var/task/node_modules/@nestjs/microservices/client/client-grpc.js:195:48)
    at Observable._trySubscribe (/var/task/node_modules/rxjs/dist/cjs/internal/Observable.js:41:25)
    at /var/task/node_modules/rxjs/dist/cjs/internal/Observable.js:35:31
    at Object.errorContext (/var/task/node_modules/rxjs/dist/cjs/internal/util/errorContext.js:22:9)
    at Observable.subscribe (/var/task/node_modules/rxjs/dist/cjs/internal/Observable.js:26:24)
    at /var/task/src/main.js:1:39406
    at new Promise (<anonymous>)
    at lastValueFrom (/var/task/src/main.js:1:39367) {
  code: 14,
  details: 'read ECONNRESET',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}

Still seeing this issue intermittent

AllanOliveiraM commented 2 days ago

Hi everyone,

I created a discussion on the google-cloud-node project because I am facing a similar issue with Google Cloud Tasks and App Engine in NestJS.

Can anyone help me resolve it?