neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.28k stars 408 forks source link

Tracing for getpage@lsn when it takes a long time #5448

Open koivunej opened 11 months ago

koivunej commented 11 months ago

If a getpage@lsn request goes long as in takes more than X seconds, we should log a warning with a description of why it went for so long (where time was spent). Breakdown could be high level as total time spent for:

Individual durations should be also exposed via global histograms of getpage execution, if they don't exist already.

Purpose of this logging would be to allow us to understand right away why something was slow.

Original slack thread: https://neondb.slack.com/archives/C05NXJFNRPA/p1696261625702299?thread_ts=1696250393.840899&cid=C05NXJFNRPA

koivunej commented 10 months ago

if one does not do a generic solution (think of some java hierarchical stopwatch) I think this could be implemented rather efficiently and still be mergeable for example for all of the get page requests a basebackup needs to do while for example outliers.

for example, there is most likely a good number of average number of layers accessed => those are on "stack" (part of future on the heap) and if we get to really detrimental cases (>10s) then we will don't need to mind spilling to new heap allocation -- regardless it is important that we'd catch enough information about such cases to fix them.