neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.22k stars 444 forks source link

fix(pageserver): preempt and retry azure list operation #9840

Open skyzh opened 1 day ago

skyzh commented 1 day ago

Problem

close https://github.com/neondatabase/neon/issues/9836

Looking at Azure SDK, the only related issue I can find is https://github.com/azure/azure-sdk-for-rust/issues/1549. Azure uses reqwest as the backend, so I assume there's some underlying magic unknown to us that might have caused the stuck in #9836.

The observation is:

This issue is hard to identify -- maybe something went wrong at the ABS side, or something wrong with our side. But we know that a retry will usually succeed if we give up the stuck connection.

Therefore, I propose the fix that we preempt stuck HTTP operation and actively retry. This would mitigate the problem, while in the long run, we need to keep an eye on ABS usage and see if we can fully resolve this problem.

The reasoning of such timeout mechanism: we use a much smaller timeout than before to preempt, while it is possible that a normal listing operation would take a longer time than the initial timeout if it contains a lot of keys. Therefore, after we terminate the connection, we should double the timeout, so that such requests would eventually succeed.

Summary of changes

github-actions[bot] commented 23 hours ago

5587 tests run: 5360 passed, 1 failed, 226 skipped (full report)


Failures on Postgres 17

Postgres 17

Test coverage report is not available

The comment gets automatically updated with the latest test results
4c8670e6c5be0f204fb1e8e9f7b253fd323b7579 at 2024-11-21T23:16:36.117Z :recycle: