Closed julienfouilhe closed 7 months ago
Can you narrow down if this is a problem on the handling requests side or the making requests side?
It seems to be on the "making requests" side, as I can see logs coming in for the grpc-js
server, but the service it's making requests to does not receive the requests (this other microservice is written in Rust and therefore does not run grpc-js).
Hey, faced similar issue after upgrading to 1.10.2. The best we know so far is after some time (not a lot) new requests just hang without getting any data from the server. We have many instances of this service that uses grpc-js and eventually all begin to fail and never recover. This could be related to idleness since they wouldn't start to fail all at the same time.
Reverting the version made the issue go away.
I can confirm this, we had the exact same issue with multiple services running on GCP. Downgrading was the fix, hopefully this gets addressed soon.
Confirmed. We experienced the exact same issue, also for multiple services running on GCP. It especially affected the Datastore node client which caused few requestst to succeed because transactions would time out.
Yep I forgot to mention it but my services are also running on GCP. More specifically on Cloud Run, which does not allocate CPU outside requests (maybe that's a hint?).
I just published version 1.10.3 with a change that reverts my best guess for the cause of this problem. Please try it out.
We're experiencing a similar issue using Google Cloud PubSub which is gRPC-based, same symptoms, basically we're getting DEADLINE_EXCEEDED
or the client says it waited too long for response data.
Pushing an update to our backend now with 1.10.3 and hope that fixes it. We noticed that it happened during times of inactivity, so https://github.com/grpc/grpc-node/pull/2677 could also be a culprit, but we'll know more in the next few hours/days as it was very random.
@jeffijoe can you be more specific about what part of #2677 you think might cause this problem? If you're talking about the session idle timeout change, that shouldn't be relevant here, because this bug is on the client side and that was a server-side change.
@murgatroid99 I was mostly skimming that one, I saw "idle" mentioned there and figured it could be related considering how we've observed this often during downtime (overnight), so don't mind me 😅
is this issue the same as what we are seeing? we are seeing hangs in our usage of firestore, after turning on the grpc traces, we see v1.10.2 in the logs.
@michaelAtCoalesce it seems to be the same issue yes.
Is this fixed by 1.10.3
?
We haven't seen the same issue reoccurring since we upgraded to 1.10.3.
Problem description
I have a microservice that uses grpc-js both to serve requests, and to make requests to other services. After upgrading to 1.10.2 from 1.10.1, we noticed that a lot of requests were not going through anymore. After looking at the request latency graph, we noticed a spike shortly after the grpc-js upgrade was released and downgraded immediately, and the service was then back to normal.
Reproduction steps
I haven't tried to reproduce it locally yet, as it just occurred and I thought it would be best to report it immediately. It also seems hard to reproduce. Our pre-production environment did not encounter any issues so I guess it needs to reach a certain number of requests/second before it happens.
Environment
Additional context
Client and server libraries are generated using protobuf-ts@2.9.3