Closed sandreim closed 2 months ago
Any thoughts about this @s0me0ne-unkn0wn @alexggh ?
My first thought would be that checksumming the entire PVF everytime could prove to be expensive, however I don't see any reasons why we can't do it periodically and cleanup the corrupted one, in this way the validator recovers quickly if we hit such a condition and we don't pay the price of checksumming all the time.
Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.
Agreed there is overhead, but let’s measure it. Assuming nodes do at most 10-12 validations on average per RCB it shouldn’t be much overhead IMO.
The largest kusama PVF has around 50MiB(the smallest is 20MiB), sha-1 on it on reference hardware seems to take around 50ms
given most PVF execution on kusama are bellow 500ms, that could be around 10% overhead.
For 10 validation per block that's an extra 500ms.
I wouldn't want to pay this price all the time for fixing this edge-case, maybe we could just check it for PVFs that fail validation as a way to try to recover the node as fast as possible.
SHA-1 is quite expensive, wouldn't a good old CRC32
fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.
I think we had an issue for this already and the idea to not pay the overhead on the happy path was:
SHA-1 is quite expensive, wouldn't a good old
CRC32
fit our usecase ? It is great at detecting accidental bit flips in network or storage devices. It won't protect against intentional changes, but we don't care about it. I like the trade off.
Checked the performance of https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_BZIP2.html & https://docs.rs/crc-catalog/latest/crc_catalog/algorithm/constant.CRC_32_CKSUM.html
I'm a bit surprised but on this 50MiB file it seems to actually perform worse than sha1, it is around 100ms.
I think we had an issue for this already and the idea to not pay the overhead on the happy path was:
- Just run it - if it fails only raise a dispute after checking the checksum.
- If failed and checksum was wrong: Well clean up that mess and issue a big fat warning in the logs.
Yeah, this is more efficient. However I am surprised by the CRC32 results.
I am surprised by the CRC32 results.
I've noticed remark that CRC32 winds up slow in practice.
Just run it - if it fails only raise a dispute after checking the checksum.
Yes, this makes sense.
We're likely happy if we lower latancy here, but have all CPU cores work hard upon this, given we're only running the check once validation fails, right?
I'd think Blake3 checks the boxes wellk enough: It's extremely fast thanks to being a Merkle tree, at the cost of using all available CPU cores. We do not need a cryptographic hash for disk corruptions, but who knows maybe something stranger becomes possible with compiler toolchains.
There was a closely related discussion in #3139. I remember Jan saying that the blake3
hasher throughput should be far enough for any practical purpose in our case.
However, the "execute, and if it fails, check the checksum" approach makes perfect sense to me.
closing in favor of https://github.com/paritytech/polkadot-sdk/issues/677
... related to https://github.com/paritytech/polkadot-sdk/issues/5413#issuecomment-2304141788
The checksum should only be stored after successful validation of candidates . It then should be checked before PVF artifact is used to validate a candidate. If it differs, we recompile the artifact and then validate the candidate. If after recompilation the validation still fails, we emit an error and stop validating using that artifact.
Any thoughts about this @s0me0ne-unkn0wn @alexggh ?