Epic: break down compute start time

stepashka commented 2 years ago

This starts with the

[x] neondatabase/cloud#19995
[ ] Think about gathering k8s stats on pod startup time and service creation
[x] neondatabase/neon#1582

and will continue with the optimisations that will come out of this, draft list:

[ ] neondatabase/neon#1562

stepashka commented 2 years ago

@ololobus , could you share what you know? :)

ololobus commented 2 years ago

Here is an example of startup log with main step timeouts from staging: https://github.com/neondatabase/neon/pull/1514#pullrequestreview-943833781

And here is from my local machine:

2022-04-25 17:22:18.097 UTC [main] INFO: starting cluster #frosty-cherry-131215, operation #9248c436-4e41-4dab-911c-c1a8e073b99c
2022-04-25 17:22:18.097 UTC [main] INFO: starting safekeepers syncing
...
2022-04-25 17:22:18.197 UTC [main] INFO: getting basebackup@0/16C4450 from pageserver host=172.30.42.11 port=6400
...
2022-04-25 17:22:18.250 GMT [20] LOG:  starting PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
...
2022-04-25 17:22:18.367 UTC [main] INFO: finished configuration of cluster #frosty-cherry-131215
...
2022-04-25 17:22:19.311 UTC [tokio-runtime-worker] INFO: serving /ready GET request

Main steps are faster, but notice this 1 sec gap before the first /ready request from the console! I think that this is because we go to k8s for a list of nodes / entry points available to reach our compute. And k8s metadata requests could be really slow. I'll add more logs for now, but a proper tracing would help a lot.

If anyone wants to get compute logs, you can use grafana, but it's hard to pick up only the startup log there. I use kubectl (use teleport to get an access) for this:

kubectl logs deployment/compute-node-frosty-cherry-131215 compute-node | less

just replace frosty-cherry-131215 with an id of your running cluster. It works the same way locally, on staging and prod.

ololobus commented 2 years ago

I think that this is because we go to k8s for a list of nodes / entry points available to reach our compute. And k8s metadata requests could be really slow. I'll add more logs for now, but a proper tracing would help a lot.

On the second thought, it could be that k8s service is just getting ready, so the pod is already running but we cannot connect to it from the outside. I'm going to add better duration logging into console https://github.com/neondatabase/cloud/pull/1104

stepashka commented 2 years ago

we'll try to do this on the second week of may

stepashka commented 1 year ago

this is an old Epic with a number of tasks closed. i suggest to close it and if there are plans for improving things on the postgres/compute side, log them separately WDYT, @vadim2404 ?

vadim2404 commented 1 year ago

let me discuss it with @hlinnaka today

hlinnaka commented 1 year ago

Here's a fresh analysis. I captured a trace of a cold start, and annotated the important parts of it to give a high-level breakdown of where the time is spent.

Test environment

The trace was captured from my local dev environment, so it's not fully representative of the real cloud setup. In particular, network roundtrips are much faster locally than in the cloud. I have compared this with partial traces from staging to ensure that this is nevertheless in the right ballpark. In the comments, I note a few places where I'm seeing a notable difference between local dev environment and staging.

This also illustrates how thorny the "long tail" of cold startups can be. If we do all the tricks listed here, we could probably get down to < 500 ms startup latency. But if any of the steps in the critical path take longer than usual, you miss the target. If you rely on caching to make any of the steps fast, but you have a cache miss, you miss the target.

The trace is from this invocation, when the compute was not yet running:

$ time psql "postgres://hlinnaka@ep-royal-bar-912840.localtest.me:4433/neondb" -c "select 1"
 ?column? 
----------
        1
(1 row)

real    0m5.305s
user    0m0.073s
sys 0m0.026s

Trace diagram

startup-trace-annotated

(1) proxy: handshake

TLS is negotiated using the Postgres protocol, and we read the project name from the SNI info in the handshake.

In this trace from my local environment, this only took about 6 ms, but in staging it typically takes about 50 ms - 100 ms. This does not include the TCP handshake between the proxy and the client, the timer is started after the TCP connection has been established. This is very sensitive to the network latency between the proxy and the client.

What we can do:

Deploy the proxy closer to the client
Postgres protocol uses one "unnecessary" roundtrip to request TLS. This could be eliminated with an upstream patch: https://www.postgresql.org/message-id/CAM-w4HOEAzxyY01ZKOj-iq%3DM4-VDk%3DvzQgUsuqiTFjFDZaebdg%40mail.gmail.com
Keep e.g. a websocket connection open from the client to the proxy

(2) proxy: get_auth_info

The proxy makes an RPC call to the control plane to request the authentication information for the endpoint.

In this trace this was pretty quick, only about 4 ms, but in staging I've seen this take up over 100 ms. This involves an RPC call from proxy to the control plane, and the control plane runs two SQL queries in the console database.

What can we do:

Deploy control plane close to the proxy (i.e control plane split)
Cache the authentication information in proxy
Optimize the console db queries

(3) kubernetes starts up the pod

This is the time it takes for the pod to start up and for 'compute_ctl' to start executing, after the kubernetes objects have been created.

This can be addressed by pre-created computes.

(4) sync_safekeepers

The compute contacts all the safekeepers, copies any last missing part of the WAL between them, and determines the current end of the branch.

This could be eliminated if we ensured that the safekeepers are synchronized on graceful shutdown.

(5) get_basebackup

The compute requests a "basebackup" tarball from the pageserver.

Things we can do to speed this up:

Don't download all the SLRUs as part of the basebackup, download them on-demand
Optimize the code in pageserver
Pre-create and cache the basebackup tarball on graceful shutdown

(6) start_postgres, wait_for_postgres

This is the time it takes for the Postgres process to start up, and to start accepting connections.

This took over 700 ms in this trace for some reason, which is a lot more than what I'm typically seeing on staging. A typical value in staging is around 100-200 ms.

Things we can do:

Start Postgres earlier, in parallel with getting the basebackup. It could be initialized up to a point without any endpoint-specific data

(7) Establish connection from 'compute_ctl' to 'postgres'

In this period, coputectl opens a connection to postgres, so that it can run the handle* steps.

This took about 200 ms in this trace.

Things we can do:

Prefetch catalogs from pageserver
If there are no changes to roles, databases etc, we could skip 'apply_config' altogether. I'm not sure if some of the delay would move to the real user connection then.

(8) handle_grants

This step is consistently much slower than the other handle_* stages, which is a bit surprising.

Things we can do:

Optimize the queries we run here
Skip this step if there have been no changes to grants
Don't perform apply_config at all (see (7))

(9) Wait for Kubernetes Service to become available

During this time, Postgres is fully started up, but the Kubernetes Endpoint object has not yet been created for the Service, so the Service cannot be reached by other pods. The control plane polls the compute_ctl's HTTP port until it can access it. That's what all the failing "HTTP computenode GET" requests below are.

Things we can do:

Make Kubernetes faster?
Pre-create computes

(10) Establish connection from proxy to compute

Now the compute is fully up and running. The proxy can now establish the libpq connection from proxy to the compute pod.

Summary

The (3) and (9) steps above will be eliminated by pre-created computes. The rest needs other work, I've listed some ideas on each step.

lassizci commented 1 year ago

There are some interesting challenges though:

(3) pre created computes

Probably the way to go for now. We have various cpu/memory requirements for the projects, so there needs to be a good scheduling decision made that takes currently available resources in to account as we naturally can't schedule 2 projects with gigabyte memory requirement to a node that has gigabyte of memory available. At the same time computes on existing nodes can scale up, so I guess some sort of synchronization needs to happen, which then starts to take time. Or then its difficult to achieve high resource utilization.

hlinnaka commented 1 year ago

We now have working opentelemetry in staging again, so here's a capture of a cold start from staging:

https://neonprod.grafana.net/goto/V2IkdOUVg?orgId=1

vadim2404 commented 1 year ago

closing this one in favor of neondatabase/neon#4695

neondatabase / neon