Open koivunej opened 1 year ago
Later @ololobus noticed that e2e tests which probably get more time or unlimited time, had failed with:
pageserver1_1 | 2023-04-28T17:37:54.575417Z WARN remote_upload{tenant=2f9c73a0c5e9868bc4de5556612246eb timeline=266eff86b905e5c58cabb38efc6c7ece upload_task_id=1}: failed to perform remote task UploadLayer(000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000149F078-000000000149F0F1, size=22413312), will retry (attempt 5): Failed to upload a layer from local path '/data/tenants/2f9c73a0c5e9868bc4de5556612246eb/timelines/266eff86b905e5c58cabb38efc6c7ece/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000149F078-000000000149F0F1'
pageserver1_1 |
pageserver1_1 | Caused by:
pageserver1_1 | 0: io error: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1 | 1: io error: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1 | 2: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1 | 3: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1 | 4: failed to lookup address information: Temporary failure in name resolution
So most likely our tests just have too small timeouts to show the actual case. On more reason to have at-drop await point information on futures.
From @bayandin's gist, this is clearly a slow wait_for_uploads:
test_runner/regress/test_tenants_with_remote_storage.py:110: in test_tenants_many
wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn)
test_runner/fixtures/pageserver/utils.py:62: in wait_for_upload
raise Exception(
E Exception: timed out while waiting for remote_consistent_lsn to reach 0/21EFAE8, was 0/1C3FAD8
Closed:
They are not actionable right now. We've been discussing with @LizardWizzard that we could have different kind of timeouts for things, but no agreement has been found.
I'll add more issues to the list.
There were several workflow attempts on my and others' PRs, for example on #4062, and #4064.
While looking for the source, it seemed that the manual checkpoint was stuck for 5min on https://github.com/neondatabase/neon/blob/fa20e3757432a0b900f33a89441f7fee02fc06c9/pageserver/src/tenant/timeline.rs#L3334-L3340
Reasoning:
2023-04-25T09:51:38.729479Z INFO request{method=PUT path=/v1/tenant/X/timeline/Y/checkpoint request_id=RANDOM}:manual_checkpoint{tenant_id=X timeline_id=Y}: compact includes 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000185CB19-0000000001934801
in allure reportWhile searching for how to run the tests locally, or even enable
RUST_LOG=pageserver=debug
on the particular test, the issue resolved itself. What remains is the usual s3 uploads were not completed in 10s graceful shutdown.Follow-up ideas:
real_s3
tests locallyus-west-2
As far as an "easy way to run the
real_s3
tests locally", this simply needs to be enabled. I got up until extracting temporary SSO credentials, but mistakenly thought I need them atus-west-2
to have access to the repo. We could most likely automate all or most of this process. @SomeoneToIgnore also suggested a patch that would make the rust side use AWS SSO credentials (slack) but I never got to use that, because of all of the pythonic checks in: https://github.com/neondatabase/neon/blob/fa20e3757432a0b900f33a89441f7fee02fc06c9/test_runner/fixtures/neon_fixtures.py#L704-L721Logging: we don't have any upload timeout, so according to the seen log messages, no single task ever failed. It would have been an
info!
level message.Tests ran from Europe use s3 buckets in
us-west-2
. Without this, we might be good with the 10s graceful timeouts, so perhaps this is a good configuration for now.