neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.58k stars 423 forks source link

Epic: Layer checksums #2683

Open koivunej opened 1 year ago

koivunej commented 1 year ago

Follow-up to #2456, which started collecting index file metadata (just the size). Related to #987 but this isn't about page checksums but file verification with hashing.

  • file checksums (at least, as a placeholder to do the format changes once) to later fill with the values, compatible with what we have in AWS S3

Originally posted by @SomeoneToIgnore in https://github.com/neondatabase/neon/issues/2456#issuecomment-1248384538

My ideas so far:

S3 supports SHA2-256 which we already have as a dependency through sha2.

andresrsanchez commented 1 year ago

Hey @koivunej are you with this task or can i try to tackle it?

Thanks!

koivunej commented 1 year ago

@andresrsanchez feel free to work on this. If there are any questions, feel free to ping me!

shanyp commented 1 year ago

@koivunej still relevant ?

koivunej commented 1 year ago

This is very much relevant, but not easy.

LizardWizzard commented 1 year ago

Worth checking if s3 sdk now supports new generation of checksum passing API. This allows to cheaply verify checksums without reading file twice. See https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html the Using trailing checksums paragraph

In case this is ready we can avoid storing checksums in index part and rely on sdk/s3 interaction. Though there a concern that this is not supported by other s3-like storages. Storing checksums in index file is more portable.

LizardWizzard commented 1 year ago

Some more context: https://github.com/awslabs/aws-sdk-rust/discussions/539#discussioncomment-2784356

shanyp commented 1 year ago

@LizardWizzard once this is implemented by the sdk, could we just toggle the check for checksum validation?

koivunej commented 1 year ago

https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/operation/put_object/builders/struct.PutObjectFluentBuilder.html#method.checksum_sha256 exists already in the most recent (which may or may not yet use).

Looking at the docs for get_object it's not entirely clear for me how to request the hashes.

What we'd need in pageserver is then:

  1. record the sha2-256 during creating of DeltaLayer and ImageLayer
    • this cannot be done completly as I'd prefer in the current format which has a summary in beginning so it's not written in single pass
    • the presence of ephmeral files give us a way to have integrity errors before we rewrite the L0 files
    • unsure if the single pass + ephmeral file integrity is trivially ensurable, might need non-file hash constructions like merkle trees
  2. offer this "file hash" when uploading
  3. ask for this "file hash" when downloading a layer
  4. one-shot generation of missing file hashes but somehow doing it in the sense that s3 wins, as in, we wouldn't refuse a local file which has been changed or modified on local disk but prefer the s3 version

sha2-256 is the best we can do compared to what s3 supports. Not much has changed from my PR description.

Continued work on from step 3 should probably wait for #4745.

LizardWizzard commented 1 year ago
  1. record the sha2-256 during creating of DeltaLayer and ImageLayer
  2. offer this "file hash" when uploading
  3. ask for this "file hash" when downloading a layer

The idea here is a slightly different. Sdk will calculate checksum during upload and pass it in http trailer. So this is transparent from calling code point of view.

Next question is whether we should validate checksums when we read files regardless of s3 stuff and how to combine the two if needed

LizardWizzard commented 1 year ago

once this is implemented by the sdk, could we just toggle the check for checksum validation?

As see it the answer is yes

koivunej commented 8 months ago

Most recent idea to work with our current way of writing out layer files (write contents first, then seek back and fill in summary): crc32, which is also supported by s3.

I've been protesting adding of crc32, but aside from some custom merkle-tree-ish (blake3 style) contraptions, we cannot get a "file hash" (think of "sha256sum file") while we write the file in two steps. crc32 would support this via crc32_combine. crc32 could work as "a stepping stone" to verify what we wrote and it eventually gives this sha-2 256 hash. "Stepping stone" could be either: