paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

PVF: consider adding a checksum for artifacts #5441

Closed sandreim closed 2 months ago

sandreim commented 2 months ago

... related to https://github.com/paritytech/polkadot-sdk/issues/5413#issuecomment-2304141788

The checksum should only be stored after successful validation of candidates . It then should be checked before PVF artifact is used to validate a candidate. If it differs, we recompile the artifact and then validate the candidate. If after recompilation the validation still fails, we emit an error and stop validating using that artifact.

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

alexggh commented 2 months ago

Any thoughts about this @s0me0ne-unkn0wn @alexggh ?

My first thought would be that checksumming the entire PVF everytime could prove to be expensive, however I don't see any reasons why we can't do it periodically and cleanup the corrupted one, in this way the validator recovers quickly if we hit such a condition and we don't pay the price of checksumming all the time.

sandreim commented 2 months ago

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

alexggh commented 2 months ago

Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.

The largest kusama PVF has around 50MiB(the smallest is 20MiB), sha-1 on it on reference hardware seems to take around 50ms given most PVF execution on kusama are bellow 500ms, that could be around 10% overhead. For 10 validation per block that's an extra 500ms.

I wouldn't want to pay this price all the time for fixing this edge-case, maybe we could just check it for PVFs that fail validation as a way to try to recover the node as fast as possible.

sandreim commented 2 months ago

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

eskimor commented 2 months ago

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

  1. Just run it - if it fails only raise a dispute after checking the checksum.
  2. If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.
alexggh commented 2 months ago

SHA-1 is quite expensive, wouldn't a good old CRC32 fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.

Checked the performance of https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_BZIP2.html & https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_CKSUM.html

I'm a bit surprised but on this 50MiB file it seems to actually perform worse than sha1, it is around 100ms.

sandreim commented 2 months ago

I think we had an issue for this already and the idea to not pay the overhead on the happy path was:

  1. Just run it - if it fails only raise a dispute after checking the checksum.
  2. If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.

Yeah, this is more efficient. However I am surprised by the CRC32 results.

burdges commented 2 months ago

I am surprised by the CRC32 results.

I've noticed remark that CRC32 winds up slow in practice.

Just run it - if it fails only raise a dispute after checking the checksum.

Yes, this makes sense.

We're likely happy if we lower latancy here, but have all CPU cores work hard upon this, given we're only running the check once validation fails, right?

I'd think Blake3 checks the boxes wellk enough: It's extremely fast thanks to being a Merkle tree, at the cost of using all available CPU cores. We do not need a cryptographic hash for disk corruptions, but who knows maybe something stranger becomes possible with compiler toolchains.

s0me0ne-unkn0wn commented 2 months ago

There was a closely related discussion in #3139. I remember Jan saying that the blake3 hasher throughput should be far enough for any practical purpose in our case. However, the "execute, and if it fails, check the checksum" approach makes perfect sense to me.

sandreim commented 2 months ago

closing in favor of https://github.com/paritytech/polkadot-sdk/issues/677