neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

lots of time wasted on `count_deltas()` #6861

Closed problame closed 6 months ago

problame commented 8 months ago

Problem

original thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1708513450565049

Now that the flamegraphs are fixed, I took one on ps-2 ap-southeast-1 to investigate the elevanted CPU usage after enabling tokio-epoll-uring there. That investigation isn't the subject of this thread though, but, the general finding of where that PS is spending its time. LayerMap::count_deltas inside time_for_new_image_layer completely dominates the CPU usage there. AFAICT that is called for every tenant, even if the layer map hasn't changed.

This is wasteful.

ps-2 ap-southeast-1 tokio-epoll-uring 60s 2

Solution

If the layer map and partitioning is the same as in an earlier call, early-exit in time_for_new_image_layer to avoid the call to count_deltas().

Tasks

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/6863
- [ ] https://github.com/neondatabase/neon/pull/6862
- [ ] https://github.com/neondatabase/neon/pull/6868
- [ ] https://github.com/neondatabase/neon/pull/7230
- [ ] find staging example that shows similar pattern
hlinnaka commented 8 months ago

The new compaction code in https://github.com/neondatabase/neon/pull/6830/ no longer calls count_deltas. (It needs testing to see if it introduces other problems of course)

problame commented 8 months ago

Yeah, aware, @arpad-m is going to work on compaction, but, it'll be many more weeks until it lands, I think.

problame commented 6 months ago

@VladLazar just in case you didn't see it, my PR to avoid count_deltas() is here: https://github.com/neondatabase/neon/pull/6868

Feel free to take it over

VladLazar commented 6 months ago

Update:

VladLazar commented 6 months ago

Looks like https://github.com/neondatabase/neon/pull/7230 helped here. Generated another flamegraph this morning and it's not exhibiting the original issue:

(ask me if you want the svg - can't add it here for some reason)

2024-04-08-ps-2-ap-southeast-1-perf-check-count-delta