Open bayandin opened 1 year ago
We also have a small number of unexpected on-demand downloads in Prod.
The unexpected on-demand downloads happen shortly after we have CPU usage spikes. (I'd guess it's irrelevant that these usage spikes are ca 1 day apart?)
Strong correlation with increase in logical size time spent
Zooming into one of them, just to demonstrate the correlation
From @bayandin's gist:
/github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/allure_commons/_allure.py:221: in __call__
return self._fixture_function(*args, **kwargs)
test_runner/fixtures/neon_fixtures.py:1122: in neon_env_builder
yield builder
test_runner/fixtures/neon_fixtures.py:850: in __exit__
self.env.pageserver.assert_no_errors()
test_runner/fixtures/neon_fixtures.py:1691: in assert_no_errors
assert not errors
E AssertionError: assert not ['2023-05-16T13:41:03.876833Z WARN eviction_task{tenant_id=92484835d2e280c5a66c3264c35b1d0f timeline_id=e9239d7e2cdba...AC0000400C0200000000-000000067F000032AC000040140000000022__00000000036FE179-0000000003DCB789 for task kind Eviction\n']
We decided to put this on hold until https://github.com/orgs/neondatabase/projects/38 (async get_value_reconstruct_data
) is merged and see whether that fixes the issue.
Here's a dump of my internal notes on this issue:
Hypothesis the logical size computation does IOs, and it does them inside the tokio executor threads ⇒ we only do 8 IOs at a time, and, block the executor threads while doing int ⇒ other tasks get delayed ⇒ …???… ⇒ imitate accesses come too late ⇒ evictions
should try to validate hypothesis first, though, but how? Can we increase the number of tokio threads?
Also, isn’t there a semaphore for logical size computation that prevents it from happening concurrently? So, are we actually only blocking 1 tokio thread?
Ah hm actually only the synthetic size calculation imitation rate-limits, but, the last_record_lsn size calculation doesn’t. So, 1000 timelines doing it at a same time will contend for the $ncpus executor threads.
Idea for the ??? in above Hyptohesis:
now
before we imitate the acceses
test_threshold_based_eviction#test_threshold_based_eviction
:Error message:
Pageserver warning itself (I broke the long line into several smaller ones to make it more readable):
See https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4092/debug/4883913165/index.html#suites/3fc871d9ee8127d8501d607e03205abb/2cff07ef9a017a6