googleapis / nodejs-storage

Node.js client for Google Cloud Storage: unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
https://cloud.google.com/storage/
Apache License 2.0
897 stars 369 forks source link

File.save sometimes returns a Promise that never rejects or resolves #2476

Closed sidsethupathi closed 2 months ago

sidsethupathi commented 4 months ago

Steps to reproduce

This is a transient issue we are seeing so unsure how to reliably reproduce.

  1. Call File.save with no additional options
  2. Sometimes observe the promise never resolves or rejects, it just hangs forever.

Our code path that hits this issue sets no extra options (specifically we do not turn resumable: true and do not change the timeout from the default 60000). I would have expected the Promise to reject after 60 seconds based on those options, but we don't see it ever reject, so it appears whatever condition triggers the 60 second timeout is not being met.

The frequency we see this issue makes it feel environment to GCP so I have already submitted a Support case (51576880 for any Google folks). Opening this GH issue in case this is an edge case in the Promise handling, or in case other people are seeing something similar. We started seeing the issue May 17th, 2024 - May 19th, 2024, then had a quiet period, and then started seeing the issue again May 31st, 2024 to now. Between May 17th and today, we haven't made any code changes.

ddelgrosso1 commented 4 months ago

Hi @sidsethupathi by chance are you running on GKE? Any particular size of data that you see this issue more with? How often are you seeing this issue when calling save?

ddelgrosso1 commented 3 months ago

Possibly related: https://github.com/googleapis/nodejs-storage/issues/2367

sidsethupathi commented 3 months ago

@ddelgrosso1 no, not running in GKE. The process is run in a container but on a container optimized GCE instance.

Also, no real pattern on size of data that triggers the issue. The files we upload are between 500KB-3MB and have seen this hang happen across that range. The frequency of our uploads are also pretty small, we upload 2 files every 5 seconds in the process that is running into this issue. We do typically run ~10-15 process concurrently (all on different VMs) but they write to the same output bucket. The processes do not run 24/7 but rather for a couple hours and then stop.

Looking at our logs since we've wrapped the File.save call inside our own timeout, in the past week we've seen 6 instances of File.save calls timing out after 20 seconds (usually the upload is extremely fast so we felt comfortable with the 20 second time). We made an estimated ~360,000 File.save calls in that same time period. Between May 19th and May 31st, 2024, we made ~650,000 File.save calls and saw none those requests timed out. We made no code changes during that time period.

ddelgrosso1 commented 3 months ago

@sidsethupathi going to add the same comments here as I gave to the rep on the internal issue.

My guess is that a network issue is causing a connection severance during the upload process. Now obviously this shouldn't result in a promise hanging. We have made several bug fixes related to hanging streams since 6.9.3 and I noted an issue above that sounds similar.

My advice would be the following (if possible):

  1. Upgrade the library to a more recent version.
  2. If the files are small (< 10 MiB) try setting resumable: false. We default to using resumable uploads and there is additional overhead associated with doing such. For a small upload a single shot upload might be a better choice.
  3. Configure retries. Uploads are conditionally idempotent and won't be retried by default (https://cloud.google.com/storage/docs/retry-strategy). Obviously this won't help if the upload hangs but I think this should be combined with number 2.
sidsethupathi commented 3 months ago

Thanks @ddelgrosso1, we've done 1 and 2 and are testing that in our dev environment.

We'll take a look at 3. We have an external "retry" right now based on the Promise rejecting or the Promise hanging indefinitely, so we may keep that around until we confirm that the we aren't seeing this issue anymore.

ddelgrosso1 commented 3 months ago

@sidsethupathi if you still see this issue after upgrading and going to resumable false, any logs or recreation steps would be greatly appreciated to help me debug further.

ddelgrosso1 commented 3 months ago

@sidsethupathi I wanted to circle back and see how the testing had been going in dev? Has the issue subsided?

sidsethupathi commented 3 months ago

@ddelgrosso1 in our dev environment, we a couple instances of the timeout on June 7th and June 8th (2024) which was after we deployed our change that bumped the library up to 7.11.1 and set resumable: false. Since June 8th, we have not seen any instances of the timeout.

In our production environment (that is still running the version of code in the original post of this issue), the last time we saw a daily occurrence of the issue was June 9th. We then saw one more occurrence on June 13th.

The frequency of the issue is so low that it's difficult to saw what impact upgrading the version and setting resumable: false may have had, but I think the June 7th and 8th occurrences running on that new version of code indicate that it was not the ultimate fix. I also still believe there might be something environmental happening within GCP that is the root cause of the issue, and it's possible that root cause started ~May 17th, stopped ~May 19th, started again ~June 4th, and then stopped again ~June 9th.

In any case, our issue is mitigated with our external timeout logic so I'll defer to your judgement whether you want to keep this issue open. If there's anything we can instrument in our dev environment to try to gather more data for you, please let me know!

ddelgrosso1 commented 3 months ago

I'll leave this open in case any further debugging info becomes available in the near future. I am however going to drop the priority down to p3.

ddelgrosso1 commented 2 months ago

Going to close this out. If more info does become available, please feel free to reopen or open a new issue.