oconnor663 / blake2_simd

high-performance implementations of BLAKE2b/s/bp/sp in pure Rust with dynamic SIMD
MIT License
126 stars 22 forks source link

very slow using same way of reading file (blake seconds) vs ( sha256 ms) #22

Closed nbari closed 4 years ago

nbari commented 4 years ago

Hi, this is the code I am using for testing: https://github.com/s3m/sandbox/blob/master/rust/blake2/src/main.rs, I don't know exactly what could I be doing wrong but for some reason, blake is running very slow:

//  let mut hasher = blake2s_simd::State::new();
6deef7c846545544274e423b7d5bdfbf653cc4ad478a1489df056ee8c84dac47
Elapsed: 17.00s

// let mut context = Context::new(&SHA256);
dca3b9746da896f05072bdec6b788513029b26ab453b82e2e9d4365e56e2c913
Elapsed: 260.58ms

The file I am using https://github.com/s3m/sandbox/blob/master/dataset/wine.json (<80MB)

Any idea of what could I be doing wrong?

oconnor663 commented 4 years ago

Are you sure you're benchmarking release mode? As in cargo run --release? The default cargo run is in debug mode, which will tank performance for Rust crates like this one. The ring crate has a lot of non-Rust code in it, which could explain why it's not affected to the same extent.

As an aside, any particular reason you're calling fill_buf and consume explicitly, instead of just using BufReader::read? Or for that matter, instead of std::io::copy?

nbari commented 4 years ago

Hi @oconnor663 cargo run --release makes a huge difference (double the speed) 💯 :

Blake
6deef7c846545544274e423b7d5bdfbf653cc4ad478a1489df056ee8c84dac47
Elapsed: 158.61ms

Sha256
dca3b9746da896f05072bdec6b788513029b26ab453b82e2e9d4365e56e2c913
Elapsed: 255.38ms

I am using fill_buf and consume (https://doc.rust-lang.org/std/io/trait.BufRead.html#tymethod.fill_buf) with the intention to read in chunks the file and not putting it all into memory, from my understanding is one of the best ways to prevent consuming resources, I also tested this with tokio:

use futures::stream::TryStreamExt;
use ring::digest::{Context, SHA256};
use std::error::Error;
use std::fmt::Write;
use std::time::Instant;
use tokio::fs::File;
use tokio_util::codec::{BytesCodec, FramedRead};

#[tokio::main]
async fn main() {
    let now = Instant::now();
    let checksum = blake("/tmp/wine.json").await.unwrap();
    println!("blake: {}", checksum);
    let elapsed = now.elapsed();
    println!("Elapsed: {:.2?}", elapsed);

    let now = Instant::now();
    let checksum = sha256_digest("/tmp/wine.json").await.unwrap();
    println!("sha256: {}", checksum);
    let elapsed = now.elapsed();
    println!("Elapsed: {:.2?}", elapsed);
}

async fn blake(file_path: &str) -> Result<String, Box<dyn Error>> {
    let file = File::open(file_path).await?;
    let mut stream = FramedRead::new(file, BytesCodec::new());
    let mut hasher = blake2s_simd::State::new();
    while let Some(bytes) = stream.try_next().await? {
        hasher.update(&bytes);
    }
    Ok(hasher.finalize().to_hex().to_string())
}

async fn sha256_digest(file_path: &str) -> Result<String, Box<dyn Error>> {
    let file = File::open(file_path).await?;
    let mut stream = FramedRead::new(file, BytesCodec::new());
    let mut context = Context::new(&SHA256);
    while let Some(bytes) = stream.try_next().await? {
        context.update(&bytes);
    }
    let digest = context.finish();
    Ok(write_hex_bytes(digest.as_ref()))
}

pub fn write_hex_bytes(bytes: &[u8]) -> String {
    let mut s = String::new();
    for byte in bytes {
        write!(&mut s, "{:02x}", byte).expect("Unable to write");
    }
    s
}
Cargo.toml dependecies

[dependencies] tokio = { version = "0.2", features = ["full"] } tokio-util = { version = "0.3", features = ["codec"] } blake2s_simd = "0.5.10" ring = "0.16.15"

Any advice about what could I optimize to speed up reading from the file or what blake2* lib/method is the best to use for getting faster results? my goal, for now, is to get as fast as possible a hash from a file so that I could use it as a reference in subsequent tasks?

Thanks in advance

oconnor663 commented 4 years ago

Using a BufReader to avoid reading the entire file into memory is a good idea, yes. But if you look at the docs for fill_buf, you'll see it mention that "this function is a lower-level call." In practice, only the implementer of the BufRead trait needs to concern themselves with fill_buf and consume. The caller can just call read. Trait implementations can be notoriously hard to track down in the docs, but if you look at the implementation of Read for BufReader, you'll see that it calls fill_buf and consume for you automatically. It also includes a nice optimization to skip the buffer when the read destination is very large.

my goal, for now, is to get as fast as possible a hash from a file so that I could use it as a reference in subsequent tasks

If you want the fastest hash possible, you should use BLAKE3 :)

But maybe you could help me understand what you mean by a reference. One of the tricky points about using optimized hash functions (especially BLAKE3) as a performance yardstick, is that they do a lot of interesting things with SIMD that lead to variable performance. Throughput will vary substantially across different machines depending on what SIMD instruction set extensions the machines support (SSE4.1, AVX2, AVX-512). Kind of related to that, the throughput can also vary a lot depending on the length of the input. At the risk of overwhelming you with information, take a look at figure 3 on page 9 of the BLAKE3 spec. There you can see that BLAKE2s and BLAKE2b are reasonably flat for anything longer than 1 KiB, but the curve for BLAKE3 doesn't really settle down until you're to the right of 64 KiB.

On the SHA-256 side of things, you'll also see massive performance variations now that the SHA extensions are finally hitting the consumer market, mainly in recent AMD chips and also in the very latest Intel stuff.

So anyway, this is all to say that if you want a hash function to be a stable performance yardstick for you across different machines, you might need to be very careful about what it is you're measuring. Without knowing your exact use case, it's hard for me to say more.

nbari commented 4 years ago

hi @oconnor663 many thanks, as a "reference" what I mean is to only know the "hash" (string) nothing else, my use case is to upload multiple files (backup) but I would like to know hash for each file, in where file size can vary from having as a max size 5TB

oconnor663 commented 4 years ago

Cool, in that case benchmark it with BLAKE3 and see what happens :)

(Note that BLAKE3 is less than a year old, though, extremely recent by hash function standards. Production applications usually want to be more conservative than that with their crypto choices.)

nbari commented 4 years ago

hi @oconnor663 I just tested following your advice and is x3 faster 🥇

blake2
6deef7c846545544274e423b7d5bdfbf653cc4ad478a1489df056ee8c84dac47
Elapsed: 137.20ms

sha256
dca3b9746da896f05072bdec6b788513029b26ab453b82e2e9d4365e56e2c913
Elapsed: 231.32ms

blake3
9f15a44727fcce9f1a36dbdd222d8db80ad41030ef677d7ecf3cc8f3d30b9a1c
Elapsed: 44.39ms

I tested with:

pub fn blake3(file_path: &str) -> Result<String, Box<dyn Error>> {
    let file = fs::File::open(file_path)?;
    let mut reader = BufReader::new(file);
    let mut hasher = blake3::Hasher::new();
    let mut buf: [u8; 8192] = [0; 8192]; //chunk size (8K, 65536, etc)

    while let Ok(size) = reader.read(&mut buf[..]) {
        if size == 0 {
            break;
        }
        hasher.update(&buf[0..size]);
    }
    Ok(hasher.finalize().to_hex().to_string())
}

Many thanks for the feedback and time on this, great stuff!

oconnor663 commented 4 years ago

If you want to go nuts, and you have a long enough file (anything > 1 MiB is good), you can also try the multithreaded implementation of BLAKE3.

oconnor663 commented 4 years ago

The b3sum utility uses multithreading by default, so if you notice that it's a lot faster than your own benchmarks, that's probably why. Multithreading requires a very large buffer size to be effective, at which point it makes more sense to memory map the entire file than to use a read buffer. (The time you would spend waiting on the reader thread to fill the buffer would be a bottleneck.)

nbari commented 4 years ago

hi @oconnor663 I tested b3sum - v0.3.6 (cargo install b3sum) in a macOS Catalina 10.15.6 but while trying to get the checksum of an iso ~ 4GB after 7 minutes I got the checksum, I am using it like this:

$ b3sum ~/Downloads/FreeBSD-12.1-RELEASE-amd64-dvd1.iso
f675a656a7f0cb0d709723021fb5046e7800675bfa2fb57d3c2ba4f1f301b73c 

seems like it reads all the file into memory it used a little more than 3GB ram:

Screenshot 2020-08-18 at 10 08 51
oconnor663 commented 4 years ago

Any chance the file is on a spinning disk? We have a know performance issue with large files in that case: https://github.com/BLAKE3-team/BLAKE3/issues/31. If something like b3sum --num-threads=1 or cat $file | b3sum performs better, the issue is probably disk thrashing.