fix(pageserver): preempt and retry azure list operation

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

Apache License 2.0

15.22k stars 444 forks source link

Problem

Looking at Azure SDK, the only related issue I can find is https://github.com/azure/azure-sdk-for-rust/issues/1549. Azure uses reqwest as the backend, so I assume there's some underlying magic unknown to us that might have caused the stuck in #9836.

The observation is:

We didn't get an explicit out of resource HTTP error from Azure.

The connection simply gets stuck and times out.

But when we retry after we reach the timeout, it succeeds.

This issue is hard to identify -- maybe something went wrong at the ABS side, or something wrong with our side. But we know that a retry will usually succeed if we give up the stuck connection.

Therefore, I propose the fix that we preempt stuck HTTP operation and actively retry. This would mitigate the problem, while in the long run, we need to keep an eye on ABS usage and see if we can fully resolve this problem.

The reasoning of such timeout mechanism: we use a much smaller timeout than before to preempt, while it is possible that a normal listing operation would take a longer time than the initial timeout if it contains a lot of keys. Therefore, after we terminate the connection, we should double the timeout, so that such requests would eventually succeed.

5587 tests run: 5360 passed, 1 failed, 226 skipped (full report)

Failures on Postgres 17

test_unlogged: debug-x86-64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_unlogged[debug-pg17]"

Flaky tests (6)

Postgres 17

test_pg_regress[None]: release-arm64
test_tenant_import[None-local_fs]: debug-x86-64
test_unlogged: debug-x86-64

Postgres 16
test_compute_pageserver_connection_stress: release-x86-64

Postgres 15
test_pull_timeline[True]: release-arm64

Postgres 14
test_pull_timeline[True]: release-x86-64

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
4c8670e6c5be0f204fb1e8e9f7b253fd323b7579 at 2024-11-21T23:16:36.117Z :recycle:}

neondatabase / neon