Requests seem not to respond anymore after upgrading to 1.10.2

julienfouilhe commented 8 months ago

Problem description

I have a microservice that uses grpc-js both to serve requests, and to make requests to other services. After upgrading to 1.10.2 from 1.10.1, we noticed that a lot of requests were not going through anymore. After looking at the request latency graph, we noticed a spike shortly after the grpc-js upgrade was released and downgraded immediately, and the service was then back to normal.

Reproduction steps

I haven't tried to reproduce it locally yet, as it just occurred and I thought it would be best to report it immediately. It also seems hard to reproduce. Our pre-production environment did not encounter any issues so I guess it needs to reach a certain number of requests/second before it happens.

Environment

OS name, version and architecture: docker node:20.8.1-alpine image
Node version 20.8.1
Node installation method docker
Package name and version gRPC@1.10.2

Additional context

Client and server libraries are generated using protobuf-ts@2.9.3

murgatroid99 commented 8 months ago

Can you narrow down if this is a problem on the handling requests side or the making requests side?

julienfouilhe commented 8 months ago

It seems to be on the "making requests" side, as I can see logs coming in for the grpc-js server, but the service it's making requests to does not receive the requests (this other microservice is written in Rust and therefore does not run grpc-js).

acdcjunior commented 8 months ago

Hey, faced similar issue after upgrading to 1.10.2. The best we know so far is after some time (not a lot) new requests just hang without getting any data from the server. We have many instances of this service that uses grpc-js and eventually all begin to fail and never recover. This could be related to idleness since they wouldn't start to fail all at the same time.

Reverting the version made the issue go away.

Ganitzsh commented 8 months ago

I can confirm this, we had the exact same issue with multiple services running on GCP. Downgrading was the fix, hopefully this gets addressed soon.

udnes99 commented 8 months ago

Confirmed. We experienced the exact same issue, also for multiple services running on GCP. It especially affected the Datastore node client which caused few requestst to succeed because transactions would time out.

julienfouilhe commented 8 months ago

Yep I forgot to mention it but my services are also running on GCP. More specifically on Cloud Run, which does not allocate CPU outside requests (maybe that's a hint?).

murgatroid99 commented 8 months ago

I just published version 1.10.3 with a change that reverts my best guess for the cause of this problem. Please try it out.

jeffijoe commented 8 months ago

We're experiencing a similar issue using Google Cloud PubSub which is gRPC-based, same symptoms, basically we're getting DEADLINE_EXCEEDED or the client says it waited too long for response data.

Pushing an update to our backend now with 1.10.3 and hope that fixes it. We noticed that it happened during times of inactivity, so https://github.com/grpc/grpc-node/pull/2677 could also be a culprit, but we'll know more in the next few hours/days as it was very random.

murgatroid99 commented 8 months ago

@jeffijoe can you be more specific about what part of #2677 you think might cause this problem? If you're talking about the session idle timeout change, that shouldn't be relevant here, because this bug is on the client side and that was a server-side change.

jeffijoe commented 8 months ago

@murgatroid99 I was mostly skimming that one, I saw "idle" mentioned there and figured it could be related considering how we've observed this often during downtime (overnight), so don't mind me 😅

michaelAtCoalesce commented 8 months ago

is this issue the same as what we are seeing? we are seeing hangs in our usage of firestore, after turning on the grpc traces, we see v1.10.2 in the logs.

https://github.com/firebase/firebase-admin-node/issues/2495

julienfouilhe commented 8 months ago

@michaelAtCoalesce it seems to be the same issue yes.

FredrikAugust commented 7 months ago

Is this fixed by 1.10.3?

jeffijoe commented 7 months ago

We haven't seen the same issue reoccurring since we upgraded to 1.10.3.

grpc / grpc-node