gilbertchen / duplicacy

A new generation cloud backup tool
https://duplicacy.com
Other
5.27k stars 340 forks source link

Enhance the Wasabi backend to perform MD5 verification at the time of backup #455

Open skidvd opened 6 years ago

skidvd commented 6 years ago

This enhancement request is ultimately for use with the Wasabi backend, but it is understood that the majority of that implementation is presently shared with the S3 backend (as the Wasabi is reported to be 100% compatible with S3).

https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/ describes the MD5 related options/capabilites. excerpt... "To ensure that S3 verifies the integrity of the object and stores the MD5 checksum in a custom HTTP header you must use the --content-md5 and --metadata arguments with the appropriate parameters."

This request is to provide the capability to (perhaps optionally) leverage these MD5 hash capabilities directly (and as an integral part thereof) from the Duplicacy backup operation (not a subsequent follow-on operation such as restore or check -v) in such a manner that the backup operation will verify the integrity of the stored files/chunks by ensuring that the Wasabi store receives and confirms the identical MD5 hash as was determined from the local repository source file/chunk before transmission. The end goal of this request is to ensure that any transmission, storage or other external errors are identified immediately during the backup operation fully verifying the successful transmission and storage of the local repository file/chunk. This will increase confidence in the backup and/or identify any errors at the time of backup (rather than waiting to discover potential corruption during a subsequent restore of check -v).

markfeit commented 6 years ago

The implementation of the Wasabi backend is a straight pass-through of all methods other than MoveFile to their counterparts in the S3 backend. A quick skim of the AWS library underpinning the S3 backend shows that it has provisions for adding Content-MD5 and X-Amz-Content-Sha256 headers for all requests and that all requests contain those headers unless explicitly disabled. This could be verified by sniffing the exchange between Duplicacy and Wasabi and looking for their presence, but I think the bottom line is that no action is required.

It's worth pointing out that even without this, your data is in good hands the whole way:

MICs are already being done as checksums in the Ethernet frames on your local network and in each TCP packet on the way to the HTTP server at the far end. Corruption is still a possibility, but it is vanishingly rare. Content-MD5 is an application-layer MIC for the data in transit and only guarantees that the payload is checked upon receipt. It's a fairly safe assumption that Wasabi's servers are using error-correcting memory to hold the data it received as a way to mitigate bit flips induced by alpha particles. The SHA256 is a secondary, more-stringent check done in the application to verify that it arrived safely.

I have some professional experience developing storage systems, and erasure coding stored data for durability is standard practice. Reed-Solomon is the usual method; Backblaze recently open-sourced its Reed-Solomon implementation and has a good video explaining how it works. Again, it's a very safe bet that this is in use at Wasabi. They're also doing integrity checks of every object every 90 days and rebuilding those with problems. I haven't done the math to figure out if their claim of eleven nines' worth of durability holds water, but given what they're likely doing internally, it does pass the straight-face test.