neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.22k stars 444 forks source link

remote_storage: Azure client doesn't tolerate server timeouts well #9836

Open jcsp opened 1 day ago

jcsp commented 1 day ago

via https://neondb.slack.com/archives/C081W75HSE7/p1732199510578199

Our requests appear to be intermittently hanging, and we're not handling it well

remote storage request timed out after 2m: list identifiers in prefix tenants/295317b97da40c627becd3a91a2e6106/timelines/ failed, will retry (attempt 0): timeout
skyzh commented 23 hours ago

Looking at the log, we got stuck for 2 minutes, and the second retry of the same operation immediately succeeded.... This likely indicates that we hit some weird limit on the Azure side...

skyzh commented 23 hours ago

And if the second retry immediately succeeded, why it doesn't permit the first request to go through...?

skyzh commented 23 hours ago

So, either this is a bug with our implementation / the blob client, or we need to deal with this situation that we actively retry

Bodobolero commented 22 hours ago

It would be helpful if you can provide playbook like instructions how to mitigate this problem until this issue is resolved

skyzh commented 22 hours ago

I think we just need to wait -- the current timeout for the list operation is 2 minute, while I believe the stuck project operation is also configured at somewhere around 2 minute. That probably explains why at the time people tag NeonBot the stuck projects are already gone, because it gets retried exactly at that moment and succeeded.