neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.77k stars 428 forks source link

disk usage based eviction: consider relative LRU order #5304

Closed koivunej closed 8 months ago

koivunej commented 1 year ago

Currently our disk usage based eviction evicts layers in the absolute order of recent accesses. It means that for example a single new timeline will first see all other timeline layers get evicted before one of it's layers gets evicted.

What if instead of using the absolute timestamps around below we would come up with a 0..1 f32 based on how relatively recently the layer has been accessed?

"Relatively recently": for layer x it's relative_recency is 1.0 - (x.last_activity_ts.as_secs_f32() / oldest_layer_access.as_secs_f32()) so that 1.0 would be for the most recently accessed layer and 0.0 for the oldest access.

This relative measure would put all timelines on the same equal footing, and would allow to handle the fast growing timeline at worst giving it some slower performance because of trashed layers but for the overall health of.

Relevant parts:

koivunej commented 1 year ago

Perhaps no floats are needed, the relative_recency could just be a index from the per tenant list.

EDIT: except that it will not be invariant to tenants having different amounts of layers, so we will still need to normalize it to f32.

koivunej commented 1 year ago

Discussed this with Heikki. He suggested reviewing what kind of algorithms there are in literature.

koivunej commented 9 months ago

Next steps:

koivunej commented 9 months ago

Discussion from the planning meeting:

koivunej commented 9 months ago

This has been tested a bit on staging. For the most part the results are encouraging via bayadin/1tb benchmark and ballast files:

It was interesting that in some cases absolute ordering did a more or less better job by only evicting from a single fast growing tenant and not any idle. I suspect this is because of imitation and per timeline eviction task: in our staging there is very little activity and most tenants on ps-7.us-east-2 are at their imitiated resident sizes. Imitation runs quite often compared to the duration of downloading the layers needed for these large tenants, and so absolute accessed order correctly saw the used once layers of a large tenant here:

2024-01-22T09:38:59.102599Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: absolute accessed selection summary: selected 384 layers of 48.1GiB up to (Some(SystemTime { tv_sec: 1705914774, tv_nsec: 815487540 }), Some(0.16)):
- 384 layers: <bench1> (48.1GiB)
2024-01-22T09:38:59.104929Z  INFO disk_usage_eviction_task:iteration{iteration_no=4646}: relative accessed selection summary: selected 532 layers of 48.2GiB up to (Some(SystemTime { tv_sec: 1705915051, tv_nsec: 836883713 }), Some(0.08)):
- 192 layers: <bench1> (24.0GiB)
- 165 layers: <bench2> (20.7GiB)
- 8 layers: <unrelated1> (56.1MiB)
- 7 layers: <unrelated2> (484.9MiB), <unrelated3> (258.1MiB)
- 6 layers: <unrelated4> (1.4MiB)
- 5 layers: 5 tenants 429.2MiB in total 25 layers
- 4 layers: <unrelated5> (114.7MiB), <unrelated6> (1.3MiB)
- 3 layers: 7 tenants 579.7MiB in total 21 layers
- 2 layers: 22 tenants 1.3GiB in total 44 layers
- 1 layers: 49 tenants 330.7MiB in total 49 layers

Full logs can be found via this search.

koivunej commented 8 months ago

Biggest problem was the 10min "layer collection" which happened together with Layer deletions hanging. Only #6634 so far as a partial solution. Using quotes around "layer collection" because it may have been the "layer collection" and absolute order reporting. However for this case the reporting was easy since there was just one tenant (above log message).

Layer deletions hanging can be explained with spawn_blocking pool having a lot of queue, but 10min does seem very long even for that. The whole node was CPU exhausted hovering at around 90% CPU utilization, with user+system totaling 50%, rest irq and iowait. I think that points to "200 downloads are too much".

koivunej commented 8 months ago

Consideration complete, didn't find anything else needing to be updated here.