googleapis / google-cloud-node

Google Cloud Client Library for Node.js
https://cloud.google.com/nodejs
Apache License 2.0
2.92k stars 591 forks source link

Should retry on ECONNRESET #2254

Closed inlined closed 7 years ago

inlined commented 7 years ago

edit by @stephenplusplus

If you're just joining the conversation...

Here is where we stand: We have disabled the forever agent by default in Cloud Function environments. If you un- and re-install @google-cloud/storage, you will pick up the new behavior automatically.

Original Post

Environment details

Steps to reproduce

  1. git clone git@github.com:firebase/functions-samples.git
  2. cd generate-thumbnail
  3. firebase init
  4. firebase deploy
  5. play with app for a while and observe that it gets into a mode that always fails with ECONNRESET

Theory

I've noticed this in my own Cloud Functions and only in those that use google-cloud-node. I think the socket is dying between requests when the Cloud Functions environment is throttled. This should be OK, but I suspect the google-cloud-node module needs to add an extra error case here. I'll write a pull request to that effect.

stephenplusplus commented 7 years ago

Continuing from https://github.com/GoogleCloudPlatform/google-cloud-node/pull/2255#issuecomment-299329661 (cc @s2young)

I'm not able to get the process mentioned in this issue to fail. Does anyone have a fool proof way I can get this script or another to break? I've been uploading 24 images at a time and the thumbnails are all generated successfully.

s2young commented 7 years ago

My app is doing quite a bit more than the example, and I'm stripping some of this out to see if I can avoid the problem. But all is just standard Storage api stuff:

  1. Download the source file to tmp directory.
  2. Call getMetadata (so I can get the byte size)- this I'm stripping out.
  3. Resize the file (3 different sizes for different screens).
  4. Upload the resized images to Storage.
  5. Delete the source file from Cloud Storage.
  6. Update firebase real-time db with info about the resized images.

The application is a mobile app, with a content management piece that allows users to upload photos when building a layout. Images are resized for display on Tablets, Large/Small phones, etc. @stephenplusplus

s2young commented 7 years ago

One more caveat @stephenplusplus: If you follow the pattern I've given, make sure you ignore the upload of the copies. In my case, when I upload a resized copy I am adding some metadata to the image so that the imageListener function can ignore those storage events. I got myself into an infinite loop early on!

stephenplusplus commented 7 years ago

Haha, thanks for the tip! I was able to reproduce the error (ECONNRESET and ESOCKETTIMEDOUT) but it was only when I was deleting the file before it was done being used. I'm not sure why that triggered those errors, but that's the only thing I can pin-point as the cause. I'll continue to play around.

Here's my script-- https://gist.github.com/stephenplusplus/1ac55732adadc1562737cf903307a98d. It takes an upload and creates 10 thumbnails. I'm testing with 25 at a time, and I'm consistently ending up with the correct total of images in the bucket, 250.

inlined commented 7 years ago

My theory theory is this happens between periods of inactivity with the Cloud Function instance is throttled. I think there's a cached socket that dies--possibly from being too CPU starved to acknowledge any keep-alive fast enough. Try the same script now (hopefully your instance is still alive) or a 20m break after that attempt. Once you see it you should see the error be pretty fatal until a redeploy (which forces a reallocation of your Cloud Function instance)

s2young commented 7 years ago

Another update, in case it helps. I have another Firebase Function that is listening to the RealTime DB for changes and then deleting the related image from Google Cloud Storage. Basically, in my CMS when a user deletes an image he's deleting the data representation of it in Firebase DB. The listener then deletes the actual image from Storage.

I am experiencing the ECONNRESET error in this db listener function. The result is I have orphaned images that are attached to object ids that no longer exist in my db.

stephenplusplus commented 7 years ago

@inlined I ran the script successfully, waited 20 minutes, and it ran successfully again without errors.

Would anyone be able to give me a repo I can clone, deploy, and then steps I can follow to reproduce the error?

s2young commented 7 years ago

Hey all, I forked your gist and add a couple of things to hopefully help - though I didn't test this so be warned!

1) Added a step to the thumbnail sequence that writes a node to the real-time db referencing the image's path. 2) Added a firebase db listener that is triggered when you delete the node created in step 1. It then removes the file from Cloud Storage (or at least is supposed to!).

My recommendation is run the script, wait a few minutes, and then manually delete some of the nodes created in the real-time db and see if the ECONNRESET happens. I hope this helps @stephenplusplus !

https://gist.github.com/s2young/068f082fd30d17f0a1f2146d953106de

s2young commented 7 years ago

Hey @stephenplusplus I've added you as a collaborator to my 'app-server' project. You should be able to deploy this to your firebase project, and then locally run 'jasmine' command to trigger the /spec/dev/mage.spec.js test. In this test's current state, I'm seeing tons of ECONNRESETs in my Firebase Functions logs. I hope it helps.

https://github.com/GobaLLC/app-server/invitations

s2young commented 7 years ago

Also, if it helps, I noticed a big jump in ECONNRESETs when I used a singleton instance of a google-cloud/storage object rather than one instance per function.

s2young commented 7 years ago

And yet another thing: I just updated my firebase-tools, and now the firebase.config().firebase call no longer works, so the code that names the bucket is broken. You'll need to tweak that.

Do you know how to programmatically grab the current-context projectId from either firebase or googe-cloud objects? It seems to me that if I'm authed on my local machine my sdk should be able to know the context it's running in. Or do I have to have a config file?? Thanks @stephenplusplus

stephenplusplus commented 7 years ago

I did some digging and found a post on StackOverflow from someone who ran into this (cc @DrPaulBrewer). He opened an issue on the official issue tracker, "cloud storage frequently throws ECONNRESET, read or write, from cloud functions environment." It seems the community has found a couple solutions:

@inlined @DrPaulBrewer @hayanmind can you check your apps using the newest @google-cloud/storage official module? Theoretically, these issues should be fixed, since we released the logic that retries on ECONNRESET errors yesterday. Also, any other information that could help us fix this once and for all would be very welcome.

As a side note, you should be able to set forever: false without forking:

var gcs = require('@google-cloud/storage')(...)
gcs.interceptors.push({
  request: function(reqOpts) {
    reqOpts.forever = false
    return reqOpts
  }
})

@s2young thank you for that gist -- I set it up and ran it. I've been doing this for the last several hours, waiting anywhere from 3, 5, 10, 25, and 60 minutes between events and monitoring the logs. Still, everything is working consistently for me.

Do you know how to programmatically grab the current-context projectId from either firebase or googe-cloud objects?

The libraries that I help maintain are the @google-cloud/* ones, so in this case @google-cloud/storage. With that library, if we're not given a projectId argument explicitly, we will use the getProjectId() function from google-auto-auth internally:

var authClient = require('google-auto-auth')()
authClient.getProjectId(function(err, projectId) {})

That will pick it up from the GCLOUD_PROJECT environment variable, the gcloud SDK, a service account JSON key file, and Google Cloud environments (App Engine, Compute Engine, Cloud Functions).

bogacg commented 7 years ago

@s2young I'm doing exact same steps (except augmenting metadata) in my app, and it ate my brain for a week thinking why using async/await causes this problem. After changing back my code to promises I kept getting error again and then I've figured it wasn't my code but Cloud Functions'.

@stephenplusplus as I stated above, my code is doing exactly same things w. @s2young's and after seeing your reply, I've deleted node_modules, npm install, recompiled (typescript/webpack) and deployed my functions. Not getting errors so far. I did get error again (after few tries within 5-7 minutes). So I'll try forever: false solution now.

However I didn't see any version number change in google-cloud-node installed from last time I was getting errors (yesterday), so is the change you made internal?

Update

After forever: false applied, I didn't get any errors, will update here if anything happens.

stephenplusplus commented 7 years ago

However I didn't see any version number change in google-cloud-node installed from last time I was getting errors (yesterday), so is the change you made internal?

Yes, I added the retry solution in a dependency, retry-request, so new and re-installations of @google-cloud/storage will pick up the change. The all-in-one package, google-cloud will not have it until our next release.

Thanks for trying these solutions out. Please keep letting us know the results you get.

stephenplusplus commented 7 years ago

If everyone can confirm:

we can automatically set forever to false when we see that we're in the Cloud Functions environment.

s2young commented 7 years ago

So far, that is true for me. I haven't experienced any further ECONNRESETs since making the forever:false change. Thanks so much!

stephenplusplus commented 7 years ago

We have disabled the forever agent by default in Cloud Function environments. If you un- and re-install @google-cloud/storage, you will pick up the new behavior automatically. Thanks for all of the helpful debugging, everyone!

DrPaulBrewer commented 7 years ago

@stephenplusplus Thanks for your efforts. Other tasks here took priority. I updated my SO posting to reflect the above changes but have not had time to test.

bobocandys commented 6 years ago

I am seeing ECONNRESET in my Google Cloud Functions now. I am using @google-cloud/bigquery: 1.3.0, @google-cloud/storage: 1.7.0

Unfortunately, even explicitly setting the reqOpts.forever = false does not solve the issue for me. Any new things changed this behavior?

atlanteh commented 6 years ago

Could this be due to the fact that network and CPU are throttled down to 0 between function invocations, so the connection is lost on the new invocation?

timscottbell commented 2 years ago

I get this error 100% of the time when trying to upload files in parallel. Adding

storage.interceptors.push({
  request(reqOpts: DecorateRequestOptions) {
    reqOpts.forever = false;
    return reqOpts;
  },
});

did nothing to resolve the issue. Our use case is using the client library to upload files in parallel, a fairly straightforward use case.