neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 429 forks source link

Test test_threshold_based_eviction is flaky (unexpectedly on-demand downloading remote layer remote) #4154

Open bayandin opened 1 year ago

bayandin commented 1 year ago

test_threshold_based_eviction#test_threshold_based_eviction:

Error message:

AssertionError: assert not ['2023-05-04T14:42:57.586772Z  WARN eviction_task{tenant_id=a4f5620478d84e0feff340fb651f8bfb timeline_id=f0d71e20c1b6f...AC0000400C00FFFFFFFF-000000067F000032AC000040140000000008__0000000001696070-0000000003DC76E9 for task kind Eviction\n']

Pageserver warning itself (I broke the long line into several smaller ones to make it more readable):

2023-05-04T14:42:57.586772Z  
WARN eviction_task{tenant_id=a4f5620478d84e0feff340fb651f8bfb timeline_id=f0d71e20c1b6fcbc94a3d391259a7fc1}:
eviction_iteration{policy_kind="LayerAccessThreshold"}: 
unexpectedly on-demand downloading remote layer remote f0d71e20c1b6fcbc94a3d391259a7fc1/000000067F000032AC0000400C00FFFFFFFF-000000067F000032AC000040140000000008__0000000001696070-0000000003DC76E9 for task kind Eviction

See https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4092/debug/4883913165/index.html#suites/3fc871d9ee8127d8501d607e03205abb/2cff07ef9a017a6

problame commented 1 year ago

We also have a small number of unexpected on-demand downloads in Prod.

The unexpected on-demand downloads happen shortly after we have CPU usage spikes. (I'd guess it's irrelevant that these usage spikes are ca 1 day apart?)

image
problame commented 1 year ago

Strong correlation with increase in logical size time spent

image
problame commented 1 year ago

Zooming into one of them, just to demonstrate the correlation

image
koivunej commented 1 year ago

From @bayandin's gist:

/github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/allure_commons/_allure.py:221: in __call__
    return self._fixture_function(*args, **kwargs)
test_runner/fixtures/neon_fixtures.py:1122: in neon_env_builder
    yield builder
test_runner/fixtures/neon_fixtures.py:850: in __exit__
    self.env.pageserver.assert_no_errors()
test_runner/fixtures/neon_fixtures.py:1691: in assert_no_errors
    assert not errors
E   AssertionError: assert not ['2023-05-16T13:41:03.876833Z  WARN eviction_task{tenant_id=92484835d2e280c5a66c3264c35b1d0f timeline_id=e9239d7e2cdba...AC0000400C0200000000-000000067F000032AC000040140000000022__00000000036FE179-0000000003DCB789 for task kind Eviction\n']
problame commented 1 year ago

We decided to put this on hold until https://github.com/orgs/neondatabase/projects/38 (async get_value_reconstruct_data) is merged and see whether that fixes the issue.

Here's a dump of my internal notes on this issue:


arssher commented 1 year ago

One more time https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4731/5585100850/index.html#categories/c1b51b30b43e587922d3feecc3c7f502/c11fb9d88c3ae9b/