neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.49k stars 421 forks source link

Epic: Improve python test fixtures to share storage env between tests #9193

Open hlinnaka opened 22 hours ago

hlinnaka commented 22 hours ago

Motivation

Our python tests take a long time.

Implementation ideas

Currently, each python test spins up a new pageserver and safekeeper, just for the one test. That's relatively expensive, the test startup and teardown steps take 3-4 s for each test. That adds up when we have a lot of small tests. Many tests don't do any special configuration or such to the pageserver or safekeepers, however, and could share them.

Sharing the storage env requires some refactoring of the python fixtures and maybe neon_local.

This is not yet a full roadmap of how to get to the end goal of sharing the storage environment, but I've identified a bunch of smaller refactoring tasks that would move us in that direction:

[ ] Clean up NeonCli so that it's more honestly just a typed wrapper around the underlying neon_local commands. Currently, it fetches some values and defaults from the NeonEnv it's attached to. Move it to a separate source file, because neon_fixtures.py is huge.

[ ] Reduce the direct calls to neon_cli from individual tests. Add functions like 'create_tenant', 'create_branch' directly to NeonEnv or other python objects. Only tests that are specifically testing neon_local functionality should call NeonCli. Others don't really care how branches are created for example, and that should be reflected in the calls they make.

[ ] Previous refactoring makes it possible to replace the neon_local calls with direct pageserver or storage_controller API calls. That allows separating the storage environment from the compute environment

[ ] Split NeonStorageEnv off from NeonEnv. NeonStorageEnv would contain only the storage-related parts, and would e.g. not carry a "pg_version" at all, because the same storage can be used with different pg versions.

[ ] Currently, most tests start with env.init_start(), which creates the environment and one initial tenant. Split it into separate calls, to start the environment, and to create the initial tenant. Tests can share the storage environment can use a fixture where the environment is already started, and go straight to creating the initial tenant

hlinnaka commented 22 hours ago

cc @MMeent, we talked about this at the offsite

MMeent commented 5 hours ago

FYI, in my own optimization project in https://github.com/neondatabase/neon/tree/faster-ci, I've focused mostly on making Tenant and Timeline abstractions, and make tests operate on those concepts using APIs that don't modify the shared state of the underlying SK/PS environment.

The main reasoning behind that is that as long as the test doesn't modify the shared state of SKs/PSs, one test's misbehaviour shoudn't impact another test's PS/SK operations (and if it does, that's a good reason for failure).

Also note how my branch seems to increase test ergonomics quite significantly for tests that don't modify shared state, as you can operate on tenants and branches directly: Rather than having to keep track of the branch and tenant name and passing that through to API calls, the API makes it obvious which branch you're working on, and names are only metadata used to give things names in e.g. logging and file names.

hlinnaka commented 4 hours ago

FYI, in my own optimization project in https://github.com/neondatabase/neon/tree/faster-ci, I've focused mostly on making Tenant and Timeline abstractions, and make tests operate on those concepts using APIs that don't modify the shared state of the underlying SK/PS environment.

The main reasoning behind that is that as long as the test doesn't modify the shared state of SKs/PSs, one test's misbehaviour shoudn't impact another test's PS/SK operations (and if it does, that's a good reason for failure).

Yep, makes sense

Also note how my branch seems to increase test ergonomics quite significantly for tests that don't modify shared state, as you can operate on tenants and branches directly: Rather than having to keep track of the branch and tenant name and passing that through to API calls, the API makes it obvious which branch you're working on, and names are only metadata used to give things names in e.g. logging and file names.

+1