neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.75k stars 428 forks source link

Epic: storage controller (née sharding service) #6342

Open jcsp opened 9 months ago

jcsp commented 9 months ago

Motivation

Enable deploying pageserver sharding into production.

Develop the code from https://github.com/neondatabase/neon/pull/6251 into a service we can deploy.

DoD

Implementation ideas

### Tasks to be able to deploy + use in staging
- [x] https://github.com/neondatabase/neon/pull/6468
- [ ] https://github.com/neondatabase/neon/pull/6471
- [ ] https://github.com/neondatabase/neon/pull/6394
- [ ] https://github.com/neondatabase/cloud/issues/9718
### Tasks to be production ready
- [x] Embed migrations in binary for ease of deployment
- [x] DB Connection pooling in persistence.rs
- [x] Clean up logs (spans etc)
- [x] Make scheduler more scalable (don't re-construct its state for every request that uses it)
- [ ] https://github.com/neondatabase/neon/issues/6847
- [ ] https://github.com/neondatabase/neon/issues/6876
- [x] Implement shard splitting (via https://github.com/neondatabase/neon/issues/6278)
- [x] Background schedule/reconcile to retry anything that has previously failed
- [x] Retry policy for HTTP client (e.g. handle 503s from /location_config)
- [ ] https://github.com/neondatabase/neon/issues/6844
- [ ] https://github.com/neondatabase/neon/issues/6878
- [ ] https://github.com/neondatabase/neon/issues/6875
- [x] Add observability API for tenants sufficient to implement "describe" CLI that shows most recent status/error for a tenant shard.
- [ ] https://github.com/neondatabase/neon/issues/7103
- [x] https://github.com/neondatabase/neon/pull/7114
- [ ] https://github.com/neondatabase/cloud/issues/10625
- [x] https://github.com/neondatabase/neon/pull/7088
- [x] Ensure helm chart isn't using rolling upgrades, to reduce risk of split brain
- [ ] https://github.com/neondatabase/neon/issues/7388
- [ ] https://github.com/neondatabase/neon/issues/7463
- [ ] https://github.com/neondatabase/neon/issues/6877
- [ ] https://github.com/neondatabase/neon/issues/6824
- [ ] Stress testing (integration test).  Similar to location_conf_churn but for this service.
- [ ] Chaos self-testing mode (for enabling in staging).  Background task that does arbitrary migrations, node drains, node failures, etc.
- [ ] Timeline creation/deletion vs. Reconciler in flight: must not send a request to an old node if a new node attach is in flight
### Miscellaneous/tech debt backlog
- [x] Add a "prod mode" that will refuse to run if auth isn't enabled (https://github.com/neondatabase/neon/pull/6585#discussion_r1476116622) (https://github.com/neondatabase/neon/pull/7105)
- [x] ~Put LocalEnv-using stuff behind a cfg(testing) macro~ We can't -- neon_local would break for anyone not using --testing
- [x] Ensure that when updating tenant conf via location config API, we don't spuriously bump generatinos
- [ ] https://github.com/neondatabase/neon/issues/7107
- [ ] https://github.com/neondatabase/neon/issues/7108
- [ ] https://github.com/neondatabase/neon/issues/6896
- [ ] Ensure that tenant config Duration/String fields are formatted consistently, to avoid  spurious reconciliations (https://github.com/neondatabase/neon/pull/6329#discussion_r1450566336).
- [ ] Revisit delete API behavior: control plane retries delete until 404 (goapp/internal/client/psclient/httppageserver/httppageserver.go), so we can do away with the wrapping of retries in the storage controller if we like
- [ ] Once we have embedded migrations, make the helm chart work with a default values.yaml and remove `--excluded-charts` (see thread on https://github.com/neondatabase/helm-charts/pull/61)

Other related tasks and Epics

jcsp commented 8 months ago

Status:

kelvich commented 7 months ago

Storage controller is deployed on prod us-east-1. Teleport RDS connection is there, but manual.