Closed sidsethupathi closed 2 months ago
Hi @sidsethupathi by chance are you running on GKE? Any particular size of data that you see this issue more with? How often are you seeing this issue when calling save?
Possibly related: https://github.com/googleapis/nodejs-storage/issues/2367
@ddelgrosso1 no, not running in GKE. The process is run in a container but on a container optimized GCE instance.
Also, no real pattern on size of data that triggers the issue. The files we upload are between 500KB-3MB and have seen this hang happen across that range. The frequency of our uploads are also pretty small, we upload 2 files every 5 seconds in the process that is running into this issue. We do typically run ~10-15 process concurrently (all on different VMs) but they write to the same output bucket. The processes do not run 24/7 but rather for a couple hours and then stop.
Looking at our logs since we've wrapped the File.save
call inside our own timeout, in the past week we've seen 6 instances of File.save
calls timing out after 20 seconds (usually the upload is extremely fast so we felt comfortable with the 20 second time). We made an estimated ~360,000 File.save
calls in that same time period. Between May 19th and May 31st, 2024, we made ~650,000 File.save
calls and saw none those requests timed out. We made no code changes during that time period.
@sidsethupathi going to add the same comments here as I gave to the rep on the internal issue.
My guess is that a network issue is causing a connection severance during the upload process. Now obviously this shouldn't result in a promise hanging. We have made several bug fixes related to hanging streams since 6.9.3
and I noted an issue above that sounds similar.
My advice would be the following (if possible):
resumable: false
. We default to using resumable uploads and there is additional overhead associated with doing such. For a small upload a single shot upload might be a better choice.Thanks @ddelgrosso1, we've done 1 and 2 and are testing that in our dev environment.
We'll take a look at 3. We have an external "retry" right now based on the Promise rejecting or the Promise hanging indefinitely, so we may keep that around until we confirm that the we aren't seeing this issue anymore.
@sidsethupathi if you still see this issue after upgrading and going to resumable false, any logs or recreation steps would be greatly appreciated to help me debug further.
@sidsethupathi I wanted to circle back and see how the testing had been going in dev? Has the issue subsided?
@ddelgrosso1 in our dev environment, we a couple instances of the timeout on June 7th and June 8th (2024) which was after we deployed our change that bumped the library up to 7.11.1
and set resumable: false
. Since June 8th, we have not seen any instances of the timeout.
In our production environment (that is still running the version of code in the original post of this issue), the last time we saw a daily occurrence of the issue was June 9th. We then saw one more occurrence on June 13th.
The frequency of the issue is so low that it's difficult to saw what impact upgrading the version and setting resumable: false
may have had, but I think the June 7th and 8th occurrences running on that new version of code indicate that it was not the ultimate fix. I also still believe there might be something environmental happening within GCP that is the root cause of the issue, and it's possible that root cause started ~May 17th, stopped ~May 19th, started again ~June 4th, and then stopped again ~June 9th.
In any case, our issue is mitigated with our external timeout logic so I'll defer to your judgement whether you want to keep this issue open. If there's anything we can instrument in our dev environment to try to gather more data for you, please let me know!
I'll leave this open in case any further debugging info becomes available in the near future. I am however going to drop the priority down to p3.
Going to close this out. If more info does become available, please feel free to reopen or open a new issue.
@google-cloud/storage
version: 6.9.3Steps to reproduce
This is a transient issue we are seeing so unsure how to reliably reproduce.
File.save
with no additional optionsOur code path that hits this issue sets no extra options (specifically we do not turn
resumable: true
and do not change thetimeout
from the default 60000). I would have expected the Promise to reject after 60 seconds based on those options, but we don't see it ever reject, so it appears whatever condition triggers the 60 second timeout is not being met.The frequency we see this issue makes it feel environment to GCP so I have already submitted a Support case (
51576880
for any Google folks). Opening this GH issue in case this is an edge case in the Promise handling, or in case other people are seeing something similar. We started seeing the issue May 17th, 2024 - May 19th, 2024, then had a quiet period, and then started seeing the issue again May 31st, 2024 to now. Between May 17th and today, we haven't made any code changes.