real_s3 tests failing on 2023-04-25

koivunej commented 1 year ago

There were several workflow attempts on my and others' PRs, for example on #4062, and #4064.

While looking for the source, it seemed that the manual checkpoint was stuck for 5min on https://github.com/neondatabase/neon/blob/fa20e3757432a0b900f33a89441f7fee02fc06c9/pageserver/src/tenant/timeline.rs#L3334-L3340

Reasoning:

last log message was 2023-04-25T09:51:38.729479Z INFO request{method=PUT path=/v1/tenant/X/timeline/Y/checkpoint request_id=RANDOM}:manual_checkpoint{tenant_id=X timeline_id=Y}: compact includes 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000185CB19-0000000001934801 in allure report
next await point is the linked lines

While searching for how to run the tests locally, or even enable RUST_LOG=pageserver=debug on the particular test, the issue resolved itself. What remains is the usual s3 uploads were not completed in 10s graceful shutdown.

Follow-up ideas:

we should have an easy way to run the real_s3 tests locally
- we suspect some team members already have the keys used by the workflow in use locally, but vacations
I would have imagined that some sort of logging in 5min would have started, in case uploads were really at fault
tests run from Europe use s3 buckets in us-west-2

As far as an "easy way to run the real_s3 tests locally", this simply needs to be enabled. I got up until extracting temporary SSO credentials, but mistakenly thought I need them at us-west-2 to have access to the repo. We could most likely automate all or most of this process. @SomeoneToIgnore also suggested a patch that would make the rust side use AWS SSO credentials (slack) but I never got to use that, because of all of the pythonic checks in: https://github.com/neondatabase/neon/blob/fa20e3757432a0b900f33a89441f7fee02fc06c9/test_runner/fixtures/neon_fixtures.py#L704-L721

Logging: we don't have any upload timeout, so according to the seen log messages, no single task ever failed. It would have been an info! level message.

Tests ran from Europe use s3 buckets in us-west-2. Without this, we might be good with the 10s graceful timeouts, so perhaps this is a good configuration for now.

koivunej commented 1 year ago

Later @ololobus noticed that e2e tests which probably get more time or unlimited time, had failed with:

pageserver1_1     | 2023-04-28T17:37:54.575417Z  WARN remote_upload{tenant=2f9c73a0c5e9868bc4de5556612246eb timeline=266eff86b905e5c58cabb38efc6c7ece upload_task_id=1}: failed to perform remote task UploadLayer(000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000149F078-000000000149F0F1, size=22413312), will retry (attempt 5): Failed to upload a layer from local path '/data/tenants/2f9c73a0c5e9868bc4de5556612246eb/timelines/266eff86b905e5c58cabb38efc6c7ece/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000149F078-000000000149F0F1'
pageserver1_1     | 
pageserver1_1     | Caused by:
pageserver1_1     |     0: io error: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1     |     1: io error: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1     |     2: error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1     |     3: dns error: failed to lookup address information: Temporary failure in name resolution
pageserver1_1     |     4: failed to lookup address information: Temporary failure in name resolution

So most likely our tests just have too small timeouts to show the actual case. On more reason to have at-drop await point information on futures.

koivunej commented 1 year ago

From @bayandin's gist, this is clearly a slow wait_for_uploads:

test_runner/regress/test_tenants_with_remote_storage.py:110: in test_tenants_many
    wait_for_upload(pageserver_http, tenant_id, timeline_id, current_lsn)
test_runner/fixtures/pageserver/utils.py:62: in wait_for_upload
    raise Exception(
E   Exception: timed out while waiting for remote_consistent_lsn to reach 0/21EFAE8, was 0/1C3FAD8

test_tenants_many[release-pg15-real_s3]

koivunej commented 1 year ago

Closed:

They are not actionable right now. We've been discussing with @LizardWizzard that we could have different kind of timeouts for things, but no agreement has been found.

I'll add more issues to the list.

neondatabase / neon

real_s3 tests failing on 2023-04-25 #4068