oscar-project / ungoliant

:spider: The pipeline for the OSCAR corpus
https://oscar-corpus.com
Apache License 2.0
162 stars 14 forks source link

[Feature request] Secure against length extension attacks #108

Open chris-ha458 opened 1 year ago

chris-ha458 commented 1 year ago

Is your feature request related to a problem? Please describe. Per folder SHA256 hashes can be potentially vulnerable to length extension attacks.

Describe the solution you'd like

  1. Change hash function to one that is resistant to length extension attacks SHA-384 and SHA-512/256 exist. However the latter is difficult to canonicalize due to slash in the name (it is part of its official name). If speed or size is a concern, BLAKE3 is extremely fast and secure as well.
  2. Add filesize in bytes when writing the hash and filename This can work, and can be achieved with minimal code change.
  3. Both of the above. It would be nice to have file sizes with the checksum for verification purposes. Also, even if the new hash has a length extension attack found, it will still be secure.

Describe alternatives you've considered DO NOT

  1. use xxhash from the zstd file. I thought about this possibility since it is already present, but xxhash64 as used in zstd is a very fast hash function with minimal security guarantees. If we assume that somebody is manipulating the json.zst file and doing so with sufficient compute to actually launch a length extension attack this will not provide any further security.

Additional context I consider OSCAR as an important part of the Data pipelines supply chain so I this the bar i hope OSCAR can clear. I will be willing to further investigate potential hash functions, implementations and provide PRs if necessary.

Reference : https://github.com/oscar-project/documentation/issues/13

chris-ha458 commented 1 year ago

Some more info that I've found

  1. sha384, sha512, blake2b are all available through gnu coreutils [sha2 utilities] (https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities.html) and b2sum GNU coreutils are widely available in most if not all linux versions that are generally used, so it would be useful to use this program.
  2. sha-512/256 nor Blake3 is not available here.
  3. adding filesizes into the produced checksum file will make it incompatible with verification with the following command sha256sum -c or sha384sum -c as such, adding it would necessitate further consideration into downstream uses.

From these findings, I will prepare a miminum viable PR that would drop-in change sha256 with sha384 and all relevant documentation. Previous datasets would not be harmed. End users can continue to use sha256 to verify their datasets. Future users can use sha384 to verify their datasets.

An option to select hashes and a robust method to support it would be optimal, but beyond the scope of a "simple" drop in patch.

chris-ha458 commented 1 year ago

So a prepared a PR to change SHA256 into SHA384.

I have been wrong on some points though. Unless SHA256 is used as part of an HMAC construction, which it doesn't seem to be, the context of length extension attack vulnerability is invalid.

There is still value in changing into SHA384 since it could be the baseline for when HMAC might be implemented. Also, SHA384 is much faster than SHA256 due to the underlying SHA512 hash is more amenable to 64bit implementations.

1MiB input
Timing sha256sum... 10_000 iterations
real    0m23.134s
user    0m22.315s
sys     0m1.060s

Timing sha384sum...10_000 iterations
real    0m16.110s
user    0m15.372s
sys     0m1.031s
Uinelj commented 1 year ago

Thanks a lot for the PR!

We are (slowly) removing code that is not immediately linked to the pipeline from ungoliant, putting it in oscar-tools, but we still have duplicates here and there. The PR should be made on oscar-tools rather than on the main project, but IIRC the code is very similar so it shouldn't be hard to do :)

Code is here: https://github.com/oscar-project/oscar-tools/blob/main/src/ops/checksum.rs

chris-ha458 commented 1 year ago

It's very simple code and the main discussion point is whether to integrate it at all, rather than how. Since it seems like there is interest in integrating such code, I will prepare a new PR under oscar-tools