Epic: Periodic service data integrity checks

SomeoneToIgnore commented 1 year ago

Motivation

For every timeline, we have multiple places where the data is stored:

pageserver local FS

/storage/pageserver/data/tenants/045343e817fec0fcfdbce86147309dd9/
├── config
├── timelines
│   └── 5843927c665e7452a8b5c530eb1aec57
│       ├── 000000000000000000000000000000000000-030000000000000000000000000000000002__0000000001696070-000000000174E389
│       ├── 000000000000000000000000000000000000-030000000000000000000000000000000002__000000000179C460
│       ├── 000000000000000000000000000000000000-030000000000000000000000000000000002__00000000017FEE08
│       ├── 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001801F41-0000000001803161
│       ├── 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001803161-0000000001805421
....
│       └── metadata
├── wal-redo-datadir/...
└── wal-redo-datadir.___temp/...

delta (000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001803161-0000000001805421) and image (000000000000000000000000000000000000-030000000000000000000000000000000002__000000000179C460) layer files
metadata file https://github.com/neondatabase/neon/blob/e52029309059ef2197aaff1a725deb4d42976fd2/pageserver/src/tenant/metadata.rs#L39-L50
ephemeral files https://github.com/neondatabase/neon/blob/e52029309059ef2197aaff1a725deb4d42976fd2/pageserver/src/tenant/ephemeral_file.rs#L1-L2
*.___temp files (almost every file and directory in pageserver was created as a tmp one and then moved into a regular one + temp walredo, initdb directories with intermediate data) https://github.com/neondatabase/neon/blob/e52029309059ef2197aaff1a725deb4d42976fd2/pageserver/src/lib.rs#L152

pageserver remote storage

000000000000000000000000000000000000-030000000000000000000000000000000002__0000000001696070-000000000174E389
000000000000000000000000000000000000-030000000000000000000000000000000002__000000000179C460
000000000000000000000000000000000000-030000000000000000000000000000000002__00000000017FEE08
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001801F41-0000000001803161
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001803161-0000000001805421
......
index_part.json

delta and image layer files (eventually consistent with local FS)
index_part.json that contains metadata file from local FS and more data about the layers. In particular, it contains a list of all layers, available for reading, to replace S3 ls command that might be more dangerous to query https://github.com/neondatabase/neon/blob/e52029309059ef2197aaff1a725deb4d42976fd2/pageserver/src/storage_sync/index.rs#L267-L278

safekeeper local FS

/storage/safekeeper/data/045343e817fec0fcfdbce86147309dd9/5843927c665e7452a8b5c530eb1aec57/
├── 000000010000000000000001.partial
└── safekeeper.control

local segments that contain WAL for streaming, may be incomplete

safekeepere remote storage

000000010000000000000001
000000010000000000000002
000000010000000000000003
000000010000000000000004
000000010000000000000005
...
0000000100000001000000AF
...

complete WAL segments, ready to be downloaded by safekeepers if they don't have the segments locally

Pageserver and safekeeper should be able to restore its state from their remote storages, additionally, safekeeper should be able to share its WAL with pageserver, so entire data is able to be restored just from safekeeper remote storage segments.

We need to ensure that always happens, for every project, once per some amount of time.

DoD

Define, what data we want to store and what is ok to loose. Pageserver has more mechanisms to export its data (fullbackup), there may be alternatives
Define data integrity criteria
- pageserver would need to match its index.json file paths with what's in the bucket
  - layer sizes (https://github.com/neondatabase/neon/pull/2582) and checksums would follow later
- safekeeper needs to ensure S3 has all layers (or all after a certain period)
- ??? we can compare local and remote files, if needed, e.g. local temp files should not stay for long time
We need to verify that we're able to restore from what we call a backup: https://github.com/neondatabase/neon/issues/2592 , manual errors on staging and other calamities might break our backups and we won't notice before we try them

Implementation ideas

It seems reasonable for this be a part of control plane or console or whatever else that manages the entire cloud unit, consisting of storage nodes, remote storages, etc.

### Tasks

Other related tasks and Epics

https://github.com/neondatabase/neon/issues/2605

SomeoneToIgnore commented 1 year ago

cc @petuhovskiy @kelvich @hlinnaka @lubennikovaav

kelvich commented 1 year ago

Should that be only about WAL (I'm talking about the header)? I was thinking more about general S3 + pageserver data integrity check.

SomeoneToIgnore commented 1 year ago

Changed the header, indeed it's about the data integrity (even though it mainly consists of WAL for our case)

SomeoneToIgnore commented 1 year ago

In addition to the checks mentioned above, we need to also ensure that the timeline being checked is not deleted: for that, a console API access is needed. Deleted timelines should not be attached on any pageservers or safekeepers (ergo, no in-memory and local FS data present), S3 data should be removed fully, or moved away into a cold storage for further, delayed deletion later.

Based on that, here's more detailed proposal on the approach.

New component

A new component has to be created, that would do the scraping. It has to be deployed in k8s similar to how broker is done: it should be able to access console, pageserver, safekeepers and their S3 buckets for the data scraping. The component has to access cloud API (worst case: cloud db (RO)), its own DB to store the scraping state (could be Neon).

The scraping part of the component at its initial phase is supposed to use HTTP(S) for management API queries, AWS S3 credentials to access the corresponding buckets, and libpq connections for data integrity checks. All storage files are checked by name and by contents to be ready for instant

full basebackup or any other user access such as get_page@Lsn
attach on another node
manual intervention if timeline data is not healthy
automatic predefined fixes applied, if available

Initial manual intervention should be needed to establish the first automatic fix rules, since operating on a potentially corrupt data files in automated way seems quite dangerous.

Due to the fact the component has to be aware of all PS and SK groups, their relations and related data, makes the component to be a good base for the control plane component in the future, but this is out of scope. Presumably, the component could benefit from getting pageservers' and safekeepers' shared code (such as TenantId, timeline data in bucket path resolution, etc.) so could be written in Rust, but that's not a hard requirement: if needed, as a proof of concept, only basic S3 and local file checks could be implemented in a form of a script, periodically run on a set of timelines.

Initial functionality

The only initial tasks of the component is to run periodic checks for every timeline (project branch) possible. They don't have to be frequent, but periodic; any request has the possibility to fail, hence the retries should be considered. The result should be displayed per timeline via HTTP query, returning JSON or HTML page.

First, the service should be able to build and periodically update the list of nodes and timelines: we should know their relations, related S3 bucket and prefixes inside (might require storage service adjustments), which timelines are removed.

Then, for every timeline, we should do a set of checks:

in-memory knowledge

Deleted in the console timeline should not be in memory of the storage components.

local files on every service: every service could be adjusted to remove this data straight from the FS Files should have names matching with the in-memory ones. Tombstone files should match the tenant/timeline status, which could be retrieved from the management API.

Deleted timeline should have no local files.

S3 files on every service's bucket Files should have expected name patterns and match the corresponding local files. Pageserver index files should be downloaded for future analysis, also its contents needs to be matched against the local files and S3 files.

It's possible to miss some of the files in the subsets, due to uploads/downloads happening, then the corresponding timeline status could be checked, or the timeline data files check itself retried soon.

Deleted timeline should have no S3 files in the "regular" place, they should be either removed or moved away into the cold storage with the postponed deletion.

data file contents: postgres amcheck could be used to verify data integrity of a compute started at particular Lsn, backed by the pageserver timeline layers that we check.

This operation is slow, (linear from the database data) and requires every used layer to be downloaded and read, so has to be run in the background on a separate pageserver, created for tests.

SomeoneToIgnore commented 1 year ago

Extra note: if the checks from the above, including postgres amcheck, are good and scale well enough, we could implement pageserver remote storage GC based on this operation: instead of removing any S3 layers on compaction or local GC, we could leave them all as is and delete (or, move into the cold storage for later eviction) the layers after backup integrity pageserver runs amcheck and downloads all needed layers locally. All not needed layers are left remote only and could be evicted, due to the way current on demand download works.

SomeoneToIgnore commented 1 year ago

More details on the checks planned to be done on the timeline:

Checks preparation

For every timeline, we need to know which tenant it relates to, and on which pageserver and safekepeer nodes its located. Every node's S3 bucket name should be also derived based on its metadata.

Console API (or database, if given) is able to provide a hierarchy of nodes and every non-deleted project (tenant id) on them, which non-deleted branches (timeline id) these projects have.

Timeline existence checks

Proof of concept, to see how applicable the whole idea is. Searches for the timelines that should no present in the storage and cleans them out.

Pageserver and safekeeper nodes have HTTP management APIs that should be used to list the timelines the nodes operate.

Tenants and timelines that are not present in the console data, should be checked against the console API for presence: current console HTTP API queries that allow resolving tenant/timeline ids into project/branch information (or its absence) need more optimisations.

Entities that are deleted or missing in the console, are considered for deletion.

The deletion should take care of S3 data first, to prevent other pageserver nodes from attaching the tenant that's deleted (same if the timeline is deleted). index_part.json could be renamed to forbid further attach attempts, and later clean the local data from every node.

Pageservers have detach/delete commands to remove tenant/timeline both from memory and local fs.

The check should consider https://github.com/neondatabase/neon/issues/3560 if that gets implemented before.

Quick timeline data inspection

For every non-deleted timeline, we are able to verify its data that we store and ensure it matches the cluster-wide expectations.

Currently, no similar checks are interesting for tenant scale: given that the previous check ensured the timelines in the tenant are active in the console, there's not much more data it has on safekeeper (none, iirc) and pageserver (there's a state enum, showing if the tenant is enabled or not, observed via the prometheus metrics; and a config file that's not yet fully designed). For the sake of brevity, tenants are not considered in this check.

On pageserver, the timeline has

a set of layers + metadata file in the FS
similar set of layers on the S3 + index_part.json file that also contains some metadata file contents
in-memory data about all layers that the timeline has in pageserver (LayerMap's historic layers)

Pageserver observes timeline layers in S3 via index_part.json file contents, and S3 list command might reveal more files.

The check should ensure the in-memory layer set and its metadata (layer file size currently, layer file checksum potentially in the future) match the one represented on local FS, and in S3: filenames and metadata should match, no extra or missing files exist in S3 and inside the index_part.json file. Local FS should contain all non-evicted layer files, metadata files and potentially some temporary files that are used for e.g. new layer downloads

Similarly, safekeepers have segment files, we should check which of these are located on S3 and which of them are locally. The local ones should be consequent and always be a subset of the ones on the S3, except for a temporary state where new segments are being uploaded to S3.

It seems reasonable to place such checks inside the nodes (at least, for the first version) since they have the code for parsing the layer/segment names and verifying them. The checks could be placed into HTTP API, yet they have to return layer/segment names and potentially other metadata for cluster-wide comparisons. For example, there are 3 safekeeper nodes per project currently, and neither of the nodes alone can compare its segment files with other nodes (there's a broker metadata exchange, yet it's limited it its message size and would not scale well as the segment file grow). Similarly, if any tenant gets attached to multiple pageserver nodes, they might overlap thei S3 writes and the checks should detect such cases.

Slow timeline data inspection

This part gets implemented after first two, since some external requirements are needed to be met.

There are two main checks to run here:

postgres amcheck

Shows data inconsistencies in the project, access entire layer set, needed for the current database state, forcing pageserver to verify all these layers and serve requests from it.

Needs Postgres RO nodes in console to spin external ones, can be run on real pageserver and safekeeper nodes (same what current console periodic checks do), but pageserver should be aware of the "check" nature of such requests.

ensure that safekeeper WAL is recoverable per timeline

Requires a separate pageserver to run on, needs S3 write sync for pageservers (not yet made).

Replays all safekepeer WAL from the beginning, does the consistency check.

LizardWizzard commented 1 year ago

Tenants and timelines that are not present in the console data, should be checked against the console API for presence:

Not sure I follow

The deletion should take care of S3 data first, to prevent other pageserver nodes from attaching the tenant that's deleted (same if the timeline is deleted).

It will need to stop local tenant first so no new files appear on s3? Also I'm hesitant to add destructive actions right away. Makes sense to make them warnings, alerts etc. So the broken state can be observed first before going forward with deletion.

NIT: It may be a good idea to change formatting of the comment to add more structure. Have a title with number for each check, rationale, steps (get X, assert A from node N equals A from node M, etc). Currently its not easy to jump through the text

SomeoneToIgnore commented 1 year ago

Not sure I follow

That's a bad formulation, but what I've wanted is to ensure that the timelines we want to delete are actually absent. The way I see this check could happen, we could select "non-deleted" projects & branches from console database, and consider every extra timeline on PS and SK nodes as "potentially deleted". Then, to ensure that it's actually not present in the DB or there, but deleted, we'd better query console (its DB) again and ensure, before deleting, that these potentially deleted timelines are not in the console for sure.

It will need to stop local tenant first so no new files appear on s3?

Good point, so we have to delete these timelines on the nodes first, and only then on S3. I agree, that some alerting should happen first, but the deletion itself is worth automating, to help with 500k stale projects that are out there now, waiting to be deleted.

LizardWizzard commented 1 year ago

is to ensure that the timelines we want to delete are actually absent

Ok, that makes sense.

to help with 500k stale projects that are out there now, waiting to be deleted.

Yeah, agree. I hope that this is a one time thing

SomeoneToIgnore commented 1 year ago

I've written some code to get a better understanding of the tasks we might want to split this into, and got up with the list of tasks that has more description inside.

So far, console seems to be a good place for initial checks at least: direct db RO access and autogenerated HTTP clients for pageserver and safekeeper nodes provide a good head start. IIRC there's no S3 manipulation library used in the project, so it might be yet another good time to discuss the S3 deletion & clean-up approach.

I'm not sure now, after writing the issues, if Quick timeline data inspection section from above is really needed, so omitted it for now.

Tasks

(needs console RO PG nodes first)

[ ] New console timeline periodic check, using postgres amcheck

LizardWizzard commented 1 year ago

(needs console RO PG nodes first)

There is one, or you need some specific setup?

SomeoneToIgnore commented 1 year ago

I might be lagging behind the news: either way, I'd start with the simplest checks and their validation: if they start to make sense, amcheck can be considered in a separate issue: one way or another, it needs another background check type with its infra.

LizardWizzard commented 1 year ago

I think I'm wrong here. You probably meant support for neon RO nodes in console, but I thought about RO access to console database. The second one is indeed available, but first is not

SomeoneToIgnore commented 1 year ago

Most of the logical pageserver checks are done in https://github.com/SomeoneToIgnore/s3-deleter and cleaned up

Current code disallows stray pageserver names and soon with https://github.com/neondatabase/neon/issues/3889 hopefully only future bugs and manual file creation can cause discrepancies.

What's left to revisit later are:

readonly pg node-based postgres amcheck Needs readonly nodes availability and maybe extra work on new timeline task kind scheduling
safekeeper WAL restoration for a timeline, which requires an extra, "fake" pageserver replica node for prolonged time, with S3 writes disabled. Needs more work on pageserver ergonomics and infra solutions for the other pageserver node: now pageserver is not running in k8s or vm autoscaler, so not that many convenient options available.

Neither of options is actionable right now, switched back to other issues. Possibly future S3 cleanup(s) would be needed, but we have some automation ready for that.

neondatabase / neon