Open skyzh opened 1 day ago
test_unlogged
: debug-x86-64
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_unlogged[debug-pg17]"
test_pg_regress[None]
: release-arm64
test_tenant_import[None-local_fs]
: debug-x86-64
test_unlogged
: debug-x86-64
test_compute_pageserver_connection_stress
: release-x86-64
test_pull_timeline[True]
: release-arm64
test_pull_timeline[True]
: release-x86-64
Problem
close https://github.com/neondatabase/neon/issues/9836
Looking at Azure SDK, the only related issue I can find is https://github.com/azure/azure-sdk-for-rust/issues/1549. Azure uses reqwest as the backend, so I assume there's some underlying magic unknown to us that might have caused the stuck in #9836.
The observation is:
This issue is hard to identify -- maybe something went wrong at the ABS side, or something wrong with our side. But we know that a retry will usually succeed if we give up the stuck connection.
Therefore, I propose the fix that we preempt stuck HTTP operation and actively retry. This would mitigate the problem, while in the long run, we need to keep an eye on ABS usage and see if we can fully resolve this problem.
The reasoning of such timeout mechanism: we use a much smaller timeout than before to preempt, while it is possible that a normal listing operation would take a longer time than the initial timeout if it contains a lot of keys. Therefore, after we terminate the connection, we should double the timeout, so that such requests would eventually succeed.
Summary of changes