Closed stepashka closed 1 year ago
@ololobus , could you share what you know? :)
Here is an example of startup log with main step timeouts from staging: https://github.com/neondatabase/neon/pull/1514#pullrequestreview-943833781
And here is from my local machine:
2022-04-25 17:22:18.097 UTC [main] INFO: starting cluster #frosty-cherry-131215, operation #9248c436-4e41-4dab-911c-c1a8e073b99c
2022-04-25 17:22:18.097 UTC [main] INFO: starting safekeepers syncing
...
2022-04-25 17:22:18.197 UTC [main] INFO: getting basebackup@0/16C4450 from pageserver host=172.30.42.11 port=6400
...
2022-04-25 17:22:18.250 GMT [20] LOG: starting PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
...
2022-04-25 17:22:18.367 UTC [main] INFO: finished configuration of cluster #frosty-cherry-131215
...
2022-04-25 17:22:19.311 UTC [tokio-runtime-worker] INFO: serving /ready GET request
Main steps are faster, but notice this 1 sec gap before the first /ready
request from the console! I think that this is because we go to k8s for a list of nodes / entry points available to reach our compute. And k8s metadata requests could be really slow. I'll add more logs for now, but a proper tracing would help a lot.
If anyone wants to get compute logs, you can use grafana, but it's hard to pick up only the startup log there. I use kubectl (use teleport to get an access) for this:
kubectl logs deployment/compute-node-frosty-cherry-131215 compute-node | less
just replace frosty-cherry-131215
with an id of your running cluster. It works the same way locally, on staging and prod.
I think that this is because we go to k8s for a list of nodes / entry points available to reach our compute. And k8s metadata requests could be really slow. I'll add more logs for now, but a proper tracing would help a lot.
On the second thought, it could be that k8s service is just getting ready, so the pod is already running but we cannot connect to it from the outside. I'm going to add better duration logging into console https://github.com/neondatabase/cloud/pull/1104
we'll try to do this on the second week of may
this is an old Epic with a number of tasks closed. i suggest to close it and if there are plans for improving things on the postgres/compute side, log them separately WDYT, @vadim2404 ?
let me discuss it with @hlinnaka today
Here's a fresh analysis. I captured a trace of a cold start, and annotated the important parts of it to give a high-level breakdown of where the time is spent.
The trace was captured from my local dev environment, so it's not fully representative of the real cloud setup. In particular, network roundtrips are much faster locally than in the cloud. I have compared this with partial traces from staging to ensure that this is nevertheless in the right ballpark. In the comments, I note a few places where I'm seeing a notable difference between local dev environment and staging.
This also illustrates how thorny the "long tail" of cold startups can be. If we do all the tricks listed here, we could probably get down to < 500 ms startup latency. But if any of the steps in the critical path take longer than usual, you miss the target. If you rely on caching to make any of the steps fast, but you have a cache miss, you miss the target.
The trace is from this invocation, when the compute was not yet running:
$ time psql "postgres://hlinnaka@ep-royal-bar-912840.localtest.me:4433/neondb" -c "select 1"
?column?
----------
1
(1 row)
real 0m5.305s
user 0m0.073s
sys 0m0.026s
TLS is negotiated using the Postgres protocol, and we read the project name from the SNI info in the handshake.
In this trace from my local environment, this only took about 6 ms, but in staging it typically takes about 50 ms - 100 ms. This does not include the TCP handshake between the proxy and the client, the timer is started after the TCP connection has been established. This is very sensitive to the network latency between the proxy and the client.
What we can do:
The proxy makes an RPC call to the control plane to request the authentication information for the endpoint.
In this trace this was pretty quick, only about 4 ms, but in staging I've seen this take up over 100 ms. This involves an RPC call from proxy to the control plane, and the control plane runs two SQL queries in the console database.
What can we do:
This is the time it takes for the pod to start up and for 'compute_ctl' to start executing, after the kubernetes objects have been created.
This can be addressed by pre-created computes.
The compute contacts all the safekeepers, copies any last missing part of the WAL between them, and determines the current end of the branch.
This could be eliminated if we ensured that the safekeepers are synchronized on graceful shutdown.
The compute requests a "basebackup" tarball from the pageserver.
Things we can do to speed this up:
This is the time it takes for the Postgres process to start up, and to start accepting connections.
This took over 700 ms in this trace for some reason, which is a lot more than what I'm typically seeing on staging. A typical value in staging is around 100-200 ms.
Things we can do:
In this period, coputectl opens a connection to postgres, so that it can run the handle* steps.
This took about 200 ms in this trace.
Things we can do:
This step is consistently much slower than the other handle_* stages, which is a bit surprising.
Things we can do:
During this time, Postgres is fully started up, but the Kubernetes Endpoint object has not yet been created for the Service, so the Service cannot be reached by other pods. The control plane polls the compute_ctl's HTTP port until it can access it. That's what all the failing "HTTP computenode GET" requests below are.
Things we can do:
Make Kubernetes faster?
Pre-create computes
Now the compute is fully up and running. The proxy can now establish the libpq connection from proxy to the compute pod.
The (3) and (9) steps above will be eliminated by pre-created computes. The rest needs other work, I've listed some ideas on each step.
There are some interesting challenges though:
(3) pre created computes
Probably the way to go for now. We have various cpu/memory requirements for the projects, so there needs to be a good scheduling decision made that takes currently available resources in to account as we naturally can't schedule 2 projects with gigabyte memory requirement to a node that has gigabyte of memory available. At the same time computes on existing nodes can scale up, so I guess some sort of synchronization needs to happen, which then starts to take time. Or then its difficult to achieve high resource utilization.
We now have working opentelemetry in staging again, so here's a capture of a cold start from staging:
closing this one in favor of neondatabase/neon#4695
This starts with the
and will continue with the optimisations that will come out of this, draft list: