googleapis / nodejs-firestore

Node.js client for Google Cloud Firestore: a NoSQL document database built for automatic scaling, high performance, and ease of application development.
https://cloud.google.com/firestore/
Apache License 2.0
636 stars 147 forks source link

RST_STREAM error keeps showing up #1023

Open jakeleventhal opened 4 years ago

jakeleventhal commented 4 years ago

Environment details

Steps to reproduce

This error keeps appearing over and over in my logs (not regularly reproducible):

Error: 13 INTERNAL: Received RST_STREAM with code 2
at Object.callErrorFromStatus (/api/node_modules/@grpc/grpc-js/src/call.ts:81)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client.ts:324)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client-interceptors.ts:439)
at Object.onReceiveStatus (/api/node_modules/@grpc/grpc-js/src/client-interceptors.ts:402)
at Http2CallStream.outputStatus (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:228)
at Http2CallStream.maybeOutputStatus (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:278)
at Http2CallStream.endCall (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:262)
at ClientHttp2Stream.<anonymous> (/api/node_modules/@grpc/grpc-js/src/call-stream.ts:532)
at ClientHttp2Stream.emit (events.js:315)
at ClientHttp2Stream.EventEmitter.emit (domain.js:485)
at emitErrorCloseNT (internal/streams/destroy.js:76)
at processTicksAndRejections (internal/process/task_queues.js:84)
schmidt-sebastian commented 4 years ago

@jakeleventhal Thanks for filing this. Do you know at all what type of request/API call this is related to?

jakeleventhal commented 4 years ago

The only place i use grpc is via firestore so it must be something from there. I don't really have more info unfortunately

whyvrafvr commented 4 years ago

I also got this one with firebase-admin 8.12.1 :


{"error":"Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
   at Object.onReceiveStatus (/usr/src/app/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus ...
whyvrafvr commented 4 years ago

This is a weird jeopardize issue, thrown during delete as create or whatever request type...

schmidt-sebastian commented 4 years ago

All document modifications (e.g. delete(), create()) use the same underlying request type. This request type is not safe for automatic retry, so unfortunately we cannot simply retry these errors.

whyvrafvr commented 4 years ago

Hi @schmidt-sebastian. I do confirm the request type is not safe. Retrying many times the same call (and so, the same request, I guess) has no significant effect. But there is no error when the nodejs instance is restarted : I guess a GRPC subroutine is hanging out.

whyvrafvr commented 4 years ago

After a while, errors, as mentioned by @jakeleventhal, are thrown and itโ€™s not possible to persist data in firestore and an instance restart is required. Thatโ€™s a real problem folks ๐Ÿ˜„ FYI, Iโ€™ve no issue using firestore delete(), create() methods with Go...

schmidt-sebastian commented 4 years ago

@skonx To rule out that this is a problem with our backend, can you send me your project ID and the time at which you saw these repeated failures? My email is mrschmidt<at>google.com.

If it is not a backend issue, we likely need GRPC logs from to further diagnose these failures. These can be obtained by setting two environment variables: GRPC_TRACE=all, GRPC_VERBOSITY=DEBUG

Note that this create a lot of logs, so hopefully this is not something we have to look at.

whyvrafvr commented 4 years ago

Yep. I've replied to your google.com email address. I will enable the environment variables asap and keep you post.

ConorEB commented 4 years ago

Hello! We also experience the same error usually coupled with Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted

We noticed that it usually fails all at once and then recovers very quickly.

sk- commented 4 years ago

We are also affected by this error. A month before, it was a sporadic error, maybe once a week, and now we are seeing it many times per day.

merlinnot commented 4 years ago

I'm also experiencing the issue for over a month now, with a couple hundred to a few thousand errors per day:

@schmidt-sebastian I opened a support case if you need a project ID or other information: 23689958.

zenati commented 4 years ago

Same problem here: Received RST_STREAM with code 2

ConorEB commented 4 years ago

We were able to resolve the Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted by upgrading to a nodejs12 runtime on App Engine and updating Electron to the newly released v9 version. So far we have not seen it again in production since the root cause was related to the node js patch: https://nodejs.org/en/blog/release/v12.14.1/

[9fd6b5e98b] - http2: fix session memory accounting after pausing (Michael Lehenbauer) #30684

After the update, the various other DB related errors have also gone away, including Received RST_STREAM with code 2.

We currently use Electron in a Google App Engine project and work in the standard runtime. Google has not updated its images to include some necessary libs for the new version of Electron. We were able to work around this as Google supports these libs in Puppeteer, so we installed Puppeteer and sent a deploy off to our production server (on a no-promote flag). After doing so, our best guess is that Google rebuilt our server images with the needed libs to run Puppeteer, which in turn allowed us to run the new version of Electron.

I hope this information helps! We spent a lot of time diving into this, so if you have any questions, feel free to respond below.

schmidt-sebastian commented 4 years ago

Our backend team has looked at the projects that were sent to us. There are no indications that your errors correlate with errors on the backend, which makes it likely that this is a client side issue. I will continue to investigate possible root causes.

schmidt-sebastian commented 4 years ago

As of v3.8.5, we are now retrying RunQuery requests that fail with RST_STREAM. If this error shows up again, we can evaluate expanding our retry to other SDK methods as well.

merlinnot commented 4 years ago

@schmidt-sebastian I've been running on v3.8.5 for the entire day, I still see RST_STREAM.

I checked my code to see for which usage patterns does it occur:

Can we please reopen this issue for visibility?

schmidt-sebastian commented 4 years ago

@merlinnot We continue to receive pushback on retrying RST_STREAM/INTERNAL for Commit RPCs. I hope your comment is enough to convince the respective folks that we should retry.

merlinnot commented 4 years ago

Thank you. If needed, I can provide a histogram or any other details, just let me know.

ururk commented 4 years ago

This is happening to a project I'm working on as well, using nodejs 10 runtime in Google Cloud functions (via firebase). I have four scripts that fail at least once a day - they make await reference.set() calls to firestore as the script executes. At the moment I don't have any retry logic. The scripts run every two minutes. They make at most 20 set calls.

schmidt-sebastian commented 4 years ago

I was able to reproduce this with GRPC trace logging enabled. I hope this will allow us to make progress here:

2020-06-12T21:42:31.665Z | call_stream | [1518132] attachHttp2Stream from subchannel 172.217.214.95:443
2020-06-12T21:42:31.665Z | subchannel | 172.217.214.95:443 callRefcount 99 -> 100
2020-06-12T21:42:31.665Z | call_stream | [1518132] sending data chunk of length 130 (deferred)
2020-06-12T21:42:31.665Z | call_stream | [1518132] calling end() on HTTP/2 stream
Firestore (3.8.1) 2020-06-12T21:42:31.714Z bH7ni [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:31.714Z bH7ni [ClientPool.acquire]: Creating a new client
Firestore (3.8.1) 2020-06-12T21:42:32.284Z ##### [Firestore]: Initialized Firestore GAPIC Client
Firestore (3.8.1) 2020-06-12T21:42:32.284Z bH7ni [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/DpuiiezGapQVLrrdqFfo","fields":{}},"currentDocument":{"exists":false}}]}
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [ClientPool.acquire]: Re-using existing client with 99 remaining operations
Firestore (3.8.1) 2020-06-12T21:42:32.548Z G31QV [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/Ua5mu1wRhNRExZLrX6K5","fields":{}},"currentDocument":{"exists":false}}]}
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [WriteBatch.commit]: Sending 1 writes
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [ClientPool.acquire]: Re-using existing client with 98 remaining operations
Firestore (3.8.1) 2020-06-12T21:42:32.599Z YsE7h [Firestore.request]: Sending request: {"database":"projects/xyz/databases/(default)","writes":[{"update":{"name":"projects/xyz/database
s/(default)/documents/a/EqrauWpiU7wXdJWnFJyb","fields":{}},"currentDocument":{"exists":false}}]}
...
020-06-12T21:42:33.833Z | call_stream | [1518033] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518033] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 100 -> 99
2020-06-12T21:42:33.833Z | call_stream | [1518034] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518034] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 99 -> 98
2020-06-12T21:42:33.833Z | call_stream | [1518035] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518035] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 98 -> 97
2020-06-12T21:42:33.833Z | call_stream | [1518036] HTTP/2 stream closed with code 2
2020-06-12T21:42:33.833Z | call_stream | [1518036] ended with status: code=13 details="Received RST_STREAM with code 2"
2020-06-12T21:42:33.833Z | subchannel | 172.217.214.95:443 callRefcount 97 -> 96
...
2020-06-12T21:42:33.838Z | subchannel | 172.217.214.95:443 callRefcount 1 -> 0
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 connection closed with error read ECONNRESET
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 READY -> IDLE
2020-06-12T21:42:33.839Z | subchannel | 172.217.214.95:443 refcount 2 -> 1
2020-06-12T21:42:33.839Z | pick_first | READY -> IDLE
2020-06-12T21:42:33.839Z | resolving_load_balancer | dns:firestore.googleapis.com:443 READY -> IDLE
2020-06-12T21:42:33.839Z | connectivity_state | dns:firestore.googleapis.com:443 READY -> IDLE
Firestore (3.8.1) 2020-06-12T21:42:33.842Z mQ1lw [Firestore.request]: Received error: { Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:341:141)
    at Object.onReceiveStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:304:181)
    at Http2CallStream.outputStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:116:74)
    at Http2CallStream.maybeOutputStatus (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:155:22)
    at Http2CallStream.endCall (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:141:18)
    at ClientHttp2Stream.stream.on (/home/mrschmidt/test/node_modules/@grpc/grpc-js/build/src/call-stream.js:403:22)
    at emitOne (events.js:116:13)
    at ClientHttp2Stream.emit (events.js:211:7)
  code: 13,
  details: 'Received RST_STREAM with code 2',
  metadata: Metadata { internalRepr: Map {}, options: {} } }
leandroz commented 4 years ago

I got the error once today, maybe it helps:

Error: 13 INTERNAL: Received RST_STREAM with code 2
    at Object.callErrorFromStatus (/srv/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client.js:175:52)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:341:141)
    at Object.onReceiveStatus (/srv/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:304:181)
    at Http2CallStream.outputStatus (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:115:74)
    at Http2CallStream.maybeOutputStatus (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:154:22)
    at Http2CallStream.endCall (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:140:18)
    at ClientHttp2Stream.stream.on (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:396:22)
    at emitOne (events.js:116:13)
    at ClientHttp2Stream.emit (events.js:211:7)
    at emit (internal/http2/core.js:177:8)
    at _combinedTickCallback (internal/process/next_tick.js:142:11)
    at process._tickDomainCallback (internal/process/next_tick.js:219:9)
exports.onTicketMessageCreate = functions.firestore
  .document('tickets/{ticketId}/messages/{messageId}')
  .onCreate(async (snap, context) => {
    const message = snap.data();
    await snap.ref.parent.parent?.update({
      messages_count: increment,
      'dates.last_message': message?.dates.created,
    });
  });
tommyjs007 commented 4 years ago

Might be a problem of grpc/grpc-js as I don't use nodejs-firestore and also face this error once in a while. It happens with stale connections

danopia commented 4 years ago

I'm also seeing very intermittent Received RST_STREAM with code 2, which in my case is followed by an endless pile of Failed to start HTTP/2 stream and Error: Exceeded maximum number of retries allowed. until the server process is terminated. I'm running nodejs 14 in a GKE pod. I'm not sure why the client can't automatically recover; it feels gross to set up a livenessprobe to kill the process just because of this issue.

schmidt-sebastian commented 4 years ago

I filed b/160000476 (Google internal). I do hope that we can get some traction.

merlinnot commented 4 years ago

@schmidt-sebastian Do you have some updates on the issue? On my side, just regular outages, nothing new ๐Ÿ˜‚

whyvrafvr commented 4 years ago

@schmidt-sebastian Do you have some updates on the issue? On my side, just regular outages, nothing new ๐Ÿ˜‚

Same...

fsikansi commented 4 years ago

Had the same issue today, Received RST_STREAM with code 2 followed by Failed to start HTTP/2 stream. Running on GKE, we had to recycle the pod to fix the outage

@google-cloud/firestore version 4.0.0

planetsanzio commented 4 years ago

Ran into the same Error: 13 INTERNAL: Received RST_STREAM with code 2 Running Node.js 10 in Google Cloud Functions. Was trying to delete a document from firestore. If it helps, the function is cron called via http. The only thing in the document was an array containing fifty elements. I have redeployed to Node.js 8 environment and have not seen the issue since. Has anyone else tried different node versions and experienced any success?

merlinnot commented 4 years ago

@JustinBeckwith @murgatroid99 I saw your involvement in https://github.com/nodejs/help/issues/2105, which seems related at a first glance. Thought I'll tag you here in case it might be somewhat useful, especially if it's an issue with Node.js or @grpc/grpc-js.

I did more research on this topic with trace logs enabled for gRPC channels (don't blame me for killing Google Cloud Logging ๐Ÿ˜‰).

What I see in my logs is:

  1. First, I see an abnormally high number of Pick result: QUEUE subchannel: undefined status: undefined undefined. Regularly, when the application is healthy, I usually see that next to dns:firestore.googleapis.com:443 createCall [251818] method="/google.firestore.v1.Firestore/Listen". Please see the two screenshots below, the first one is "regular" (application works fine) the second one is right before further errors happen.

    Screen Shot 2020-07-02 at 09 49 42 Screen Shot 2020-07-02 at 09 50 46
  2. Then I see an error thrown a couple of times: Error: 13 INTERNAL: Received RST_STREAM with code 2. FYI, "disconnecting" on the screenshot is my application's log statement, not related to the issue.

    Screen Shot 2020-07-02 at 09 52 50
  3. Then I can still see a couple (not a lot, maybe 20?) logs without errors:

    Screen Shot 2020-07-02 at 09 50 46
  4. Then I can see failures due to the session being destroyed, Failed to start call on picked subchanel 216.58.205.42:443 with error The session has been destroyed. Ending call. That error is logged ~200 times.

    Screen Shot 2020-07-02 at 09 57 09
  5. Next, it goes back to Pick result: QUEUE subchannel: undefined status: undefined undefined next to createCall.

    Screen Shot 2020-07-02 at 09 58 43
  6. Then again Pick result: QUEUE subchannel: undefined status: undefined undefined, but this time with no createCall. That happens a couple hundred times.

    Screen Shot 2020-07-02 at 10 01 02
  7. I can then see some Pick result: COMPLETE subchannel: 216.58.198.170:443 status: undefined undefined, maybe a few dozen.

    Screen Shot 2020-07-02 at 10 03 26
  8. It then starts happening next to createCall:

    Screen Shot 2020-07-02 at 10 05 31
  9. Next, it starts going between QUEUE/COMPLETE:

    Screen Shot 2020-07-02 at 10 07 11
  10. And then the worst starts, Failed to start call on picked subchannel 216.58.204.42:443 with error New streams cannot be created after receiving a GOAWAY. Retrying pick. It goes on forever until the node is restarted.

    Screen Shot 2020-07-02 at 10 20 28

For any Googler who'd like to examine the logs themselves:

murgatroid99 commented 4 years ago

@merlinnot I'm sorry to say that most of what you have there is not meaningful patterns, just varied interleavings of gRPC API calls and internal state changes. Still, there are a couple of interesting things here.

The error in your final log is in fact my workaround for the issue you linked to. That error means that the underlying http2 session seems to be healthy right up to the point where gRPC tries to start a stream, then starting the stream throws an error saying "New streams cannot be created after receiving a GOAWAY". The expectation is that this error raced with the session emitting a "goaway" event, and that by the time the retry goes through that connection will have been torn down and a new one will be started. If that error just keeps happening forever, it might imply that the http2 session is throwing that error without ever emitting an event indicating that it can no longer be used, which would be pretty bad.

The error in number 4 follows the same code path, but that is not a recognized error so it just ends the call with an error instead of automatically retrying. Again, if that error is happening repeatedly, that might imply that there is an unreasonably long time between "The http2 session becomes unusable for starting streams" and "the http2 session emits an event indicating that it cannot be used for starting streams". There is actually a simple way to handle those errors better, if I just assume that I cannot rely on the http2 session to emit disconnection events when it becomes disconnected.

The error in number 2 is one that has been reported several times in various contexts, but it really just means "Internal server error". I'm planning to add that information to that error message.

schmidt-sebastian commented 4 years ago

Since all my other attempts of getting this resolved have stalled, I have started an internal discussion on whether we can retry writes that we deem to be safe to retry (e.g. no FieldValue.increment() transforms). A modified client that I have running with retries enabled has not seen a single failed write in over a week, whereas before it failed with a RST_STREAM error after 2 days.

merlinnot commented 4 years ago

@murgatroid99 That's also what I understood from the code, let me rephrase it to make sure I understand correctly: when we receive ERR_HTTP2_GOAWAY_SESSION, we retry immediately, as the expectation is that an goaway event will be emitted and when retrying we won't bump into that again.

However, @JustinBeckwith reported that he receives ERR_HTTP2_GOAWAY_SESSION without ever receiving goaway.

Is it then an idea to also close the session when we receive ERR_HTTP2_GOAWAY_SESSION?

From what I can see transitionToState gets current expected state as a precondition, so it can be easily called in an idempotent way?

murgatroid99 commented 4 years ago

That's pretty much the idea, yes. Based on some separate conversations I had with Justin, I thought the error he was encountering was unrelated, but maybe gRPC can get to the same bad state.

merlinnot commented 4 years ago

@schmidt-sebastian In my case, when the New streams cannot be created after receiving a GOAWAY error starts happening, it never resolves - workers go to 100% CPU and the thread must be killed and restarted. I've seen it spinning for hours.

This always comes after Error: 13 INTERNAL: Received RST_STREAM with code 2. Previously I didn't have gRPC trace logs enabled, but the symptoms were exactly the same: a couple of RST_STREAM with code 2 and then 100% CPU straight away until crashes (OOM, manual restarts, ...).

So it looks to me that yes, we need to find a solution for RST_STREAM with code 2 (e.g. retries you mentioned) but there's also a separate issue with ERR_HTTP2_GOAWAY_SESSION which is somehow triggered by this error.

Nowadays I have very eager alerting for this issue, so I can try to trigger it if we have a potential solution. I noticed a correlation between the amount of VMs that run my workloads and frequency of these errors. I overprovisioned my infrastructure 6x and that basically went away, yesterday we deployed a change that makes 2x the amount of requests to Firestore and this morning 3 workers on 2 separate VMs had the issue. So I doubled the amount of VMs. I guess if I reduce the number of VMs again, one would fail sooner or later.

I hope it's not Firestore denial of service prevention kicking in and resetting the stream, do you think there's such possibility? Do you have logs for it to check?

GyuriMajercsik commented 4 years ago

I'm not sure if what I experience is related but I'm posting it here.

I'm trying to write some tests for my cloud functions but I receiving this error:

13 INTERNAL: Received RST_STREAM with code 2

      48 |         };
      49 | 
    > 50 |         await userRef.create(user);
         |                       ^
      53 | 

      at Object.callErrorFromStatus (functions/node_modules/@grpc/grpc-js/src/call.ts:81:24)
      at Object.onReceiveStatus (functions/node_modules/@grpc/grpc-js/src/client.ts:334:36)
      at Object.onReceiveStatus (functions/node_modules/@grpc/grpc-js/src/client-interceptors.ts:434:34)
      at Object.onReceiveStatus (functions/node_modules/@grpc/grpc-js/src/client-interceptors.ts:397:48)
      at Http2CallStream.outputStatus (functions/node_modules/@grpc/grpc-js/src/call-stream.ts:230:22)
      at Http2CallStream.maybeOutputStatus (functions/node_modules/@grpc/grpc-js/src/call-stream.ts:280:14)
      at Http2CallStream.endCall (functions/node_modules/@grpc/grpc-js/src/call-stream.ts:263:12)
      at ClientHttp2Stream.<anonymous> (functions/node_modules/@grpc/grpc-js/src/call-stream.ts:552:14)
      Caused by: Error: 
      at WriteBatch.commit (functions/node_modules/@google-cloud/firestore/build/src/write-batch.js:415:23)
      at DocumentReference.create (functions/node_modules/@google-cloud/firestore/build/src/reference.js:283:14)
      at Object.<anonymous> (functions/spec/users.spec.js:50:23)

Does somebody knows if there is a workaround for this ?

schmidt-sebastian commented 4 years ago

@GyuriMajercsik Since we currently cannot tell whether the write you issued is safe to retry, you need to manually retry the write. Note that Received RST_STREAM with code 2 translates to "Internal Error". This means that your write could have succeeded, in which case the next create() invocation will fail with FAILED_PRECONDITION, as create() enforces that no document exists at the path. We don't yet have a better story than this.

GyuriMajercsik commented 4 years ago

This happens in my unit tests every time I run them, not in production. I might not have been clear enough in my previous comment... Is there a way I can debug this or help to narrow the issue?

GyuriMajercsik commented 4 years ago

I found the issue for my problem. I had the wrong port (8081 instead of 8080) set for FIRESTORE_EMULATOR_HOST.

murgatroid99 commented 3 years ago

I have published @grpc/grpc-js version 1.1.2 with a change that should stop the "spinning on New streams cannot be created after receiving a GOAWAY problem".

I also modified the original error reported here to say "Received RST_STREAM with code 2 (Internal server error)". That should help clarify what causes the error for people who see it in the future.

merlinnot commented 3 years ago

@murgatroid99 I put the new version on production and significantly decreased the number of instances, which was previously triggering the error consistently. I'll observe it for 24h and report back if the spinning issue is fixed. Thanks for picking this up!

merlinnot commented 3 years ago

@murgatroid99 I got the RST_STREAM error a couple of times overnight, but subsequent calls were successful! So at least I can actually retry operations that fail. Thank you!

@schmidt-sebastian With half of the problem fixed, what are our next steps for the RST_STREAM error? I'm sure you already thought about it, but I'm wondering why do we need retries in the client, can't it be fixed on the back end side? It happens to me every few hours, maybe there are some Firestore logs that we can check? If it's an internal error, I'd expect Google to have some logs for it?

merlinnot commented 3 years ago

I don't know if it's important, but I'm running Firestore in multi-region eur3.

radvansky-tomas commented 3 years ago

Me too, constantly now https://github.com/googleapis/nodejs-datastore/issues/679#issuecomment-656573765

merlinnot commented 3 years ago

So up until now we have reports for multiple environments:

And at least three Node.js versions:

There's still some probability that all Node.js versions are broken, or networking in GCP (affecting all compute offerings) is misbehaving, or simply there's an issue with Firestore itself or client libraries.

I'm wondering how can we narrow the scope, I have two ideas:

schmidt-sebastian commented 3 years ago

The issue does reproduce for other client libraries and languages, but it seems to be at a much lower rate. We have gotten a few reports from Python users, but our user base is also heavily skewed towards Node.

The RST_STREAM errors that I have seen in my tests ended up not hitting the Firestore backend, and as such we need to look elsewhere for the culprit. With the new fixes in @grpc/grpc-js released, I am currently trying to reproduce it one more time. We also just obtained access to Google's frontend logs, which covers a part of our infrastructure that we haven't yet looked at. We might be able to speed this up if you have an exact time where you saw an error (with @grpc/grpc-js 1.1.2, as we don't want to look into errors that this release fixed), otherwise we have to wait until my repro triggers one.

If that doesn't work, the next culprit might be the network layer that GCP uses (the Remote Socket Gateway). In the early stages of our product development, most of our problems originated here, but as of late we have seen little evidence that the Remote Socket Gateway negatively impacts our networking.

merlinnot commented 3 years ago

Here are the timestamps since I deployed @grpc/grpc-js v1.1.2:

I can also provide IP addresses of VMs for each of those if that helps, you can also access logs I mentioned in one of the messages above.

Since you mentioned GCP networking stack, I also observed a similar issue recently where our mobile applications are not able to invoke Cloud Functions. Similarly, an internal error is returned, but there's no evidence of Cloud Functions being called. It might be completely unrelated, but sounds almost the same ๐Ÿคทโ€โ™‚๏ธ If you need more details, the case number is 24108749.

schmidt-sebastian commented 3 years ago

@merlinnot We looked at your logs. We see some failed Listen requests for this time window and a fair stream of ABORTED errors for writes. There are no other errors, so this means:

To get to the bottom of this, I need to figure out to reproduce this in a relatively short amount of time, which would allow us to look at all internal network requests that lead to RST_STREAM.

Do you see this error more often on Cloud Functions than in other environments?

rdev-software commented 3 years ago

Resolved in my case, caused by presence

// process.env.FIRESTORE_EMULATOR_HOST = 'localhost:8080';
// process.env.FIREBASE_FIRESTORE_EMULATOR_ADDRESS = 'localhost:8080';

Causing invalid firestore connection

merlinnot commented 3 years ago

Do you see this error more often on Cloud Functions than in other environments?

Not really, I don't see any pattern here. The issue occurs in Cloud Run, Cloud Functions and raw VMs. My Firestore usage is higher in Cloud Functions and VMs, so this error happens more frequently in these environments.