Open koivunej opened 1 year ago
Hey @koivunej are you with this task or can i try to tackle it?
Thanks!
@andresrsanchez feel free to work on this. If there are any questions, feel free to ping me!
@koivunej still relevant ?
This is very much relevant, but not easy.
Worth checking if s3 sdk now supports new generation of checksum passing API. This allows to cheaply verify checksums without reading file twice. See https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html the Using trailing checksums
paragraph
In case this is ready we can avoid storing checksums in index part and rely on sdk/s3 interaction. Though there a concern that this is not supported by other s3-like storages. Storing checksums in index file is more portable.
@LizardWizzard once this is implemented by the sdk, could we just toggle the check for checksum validation?
https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/operation/put_object/builders/struct.PutObjectFluentBuilder.html#method.checksum_sha256 exists already in the most recent (which may or may not yet use).
Looking at the docs for get_object it's not entirely clear for me how to request the hashes.
What we'd need in pageserver is then:
sha2-256 is the best we can do compared to what s3 supports. Not much has changed from my PR description.
Continued work on from step 3 should probably wait for #4745.
- record the sha2-256 during creating of DeltaLayer and ImageLayer
- offer this "file hash" when uploading
- ask for this "file hash" when downloading a layer
The idea here is a slightly different. Sdk will calculate checksum during upload and pass it in http trailer. So this is transparent from calling code point of view.
Next question is whether we should validate checksums when we read files regardless of s3 stuff and how to combine the two if needed
once this is implemented by the sdk, could we just toggle the check for checksum validation?
As see it the answer is yes
Most recent idea to work with our current way of writing out layer files (write contents first, then seek back and fill in summary): crc32, which is also supported by s3.
I've been protesting adding of crc32, but aside from some custom merkle-tree-ish (blake3 style) contraptions, we cannot get a "file hash" (think of "sha256sum file") while we write the file in two steps. crc32 would support this via crc32_combine. crc32 could work as "a stepping stone" to verify what we wrote and it eventually gives this sha-2 256 hash. "Stepping stone" could be either:
Follow-up to #2456, which started collecting index file metadata (just the size). Related to #987 but this isn't about page checksums but file verification with hashing.
Originally posted by @SomeoneToIgnore in https://github.com/neondatabase/neon/issues/2456#issuecomment-1248384538
My ideas so far:
Option
problem inIndexFileMetadata
S3 supports SHA2-256 which we already have as a dependency through
sha2
.