sharksforarms / deku

Declarative binary reading and writing: bit-level, symmetric, serialization/deserialization
Apache License 2.0
1.05k stars 54 forks source link

Performance of `read_all` and `count` #439

Open duskmoon314 opened 1 month ago

duskmoon314 commented 1 month ago

I just found that read_all seems to perform worse than count. So, I'm opening this issue to see whether there is room for improvement.

The rust code I used to test is as follows. I defined two simple structs that only wrap Vec<u8> and use Deku to read [u8; 1500] into these structs.

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use deku::prelude::*;

pub fn read_all_vs_count(c: &mut Criterion) {
    #[derive(DekuRead, DekuWrite)]
    pub struct AllWrapper {
        #[deku(read_all)]
        pub data: Vec<u8>,
    }

    #[derive(DekuRead, DekuWrite)]
    #[deku(ctx = "len: usize")]
    pub struct CountWrapper {
        #[deku(count = "len")]
        pub data: Vec<u8>,
    }

    c.bench_function("read_all_bytes", |b| {
        b.iter(|| AllWrapper::from_bytes(black_box((&[1; 1500], 0))))
    });

    c.bench_function("read_all", |b| {
        b.iter(|| {
            let mut cursor = [1u8; 1500].as_ref();
            let mut reader = Reader::new(&mut cursor);
            AllWrapper::from_reader_with_ctx(black_box(&mut reader), ())
        })
    });

    c.bench_function("count", |b| {
        b.iter(|| {
            let mut cursor = [1u8; 1500].as_ref();
            let mut reader = Reader::new(&mut cursor);
            CountWrapper::from_reader_with_ctx(black_box(&mut reader), 1500)
        })
    });
}

criterion_group!(benches, read_all_vs_count);
criterion_main!(benches);

The output of criterion:

read_all_bytes          time:   [55.840 µs 55.882 µs 55.924 µs]
                        change: [-0.3273% -0.2149% -0.1034%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

read_all                time:   [57.742 µs 57.794 µs 57.856 µs]
                        change: [+0.6835% +0.8126% +0.9457%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

count                   time:   [4.2984 µs 4.2997 µs 4.3013 µs]
                        change: [+0.2270% +0.2768% +0.3244%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

The result shows that read_all takes nearly 14 times longer than count on my machine (CPU: x86-64 2.10GHz 80threads).

Dig into the code:

// read_all
let mut res = capacity.map_or_else(Vec::new, Vec::with_capacity);
loop {
    if reader.end() {
        break;
    }
    let val = <T>::from_reader_with_ctx(reader, ctx)?;
    res.push(val);
}
Ok(res)

// count
let mut res = capacity.map_or_else(Vec::new, Vec::with_capacity);
let start_read = reader.bits_read;
loop {
    let val = <T>::from_reader_with_ctx(reader, ctx)?;
    res.push(val);
    // I replace the original code here
    count -= 1;
    if count == 0 {
        break;
    }
}
Ok(res)

The code is not very different; there is just one check, count == 0, and the other, reader.end(). I wonder whether reader.end() can be improved to perform better.

I also wonder whether it is possible to read an exact number of bytes instead of the loop for Vec<u8>. But I haven't figured out how to specialize the trait implementation for Vec<u8> over the generic Vec<T>.

wcampbell0x2a commented 1 month ago

Quickly wrote up a possible solution here: https://github.com/sharksforarms/deku/pull/441

EDIT: Eh, I need to fix the test failures.

wcampbell0x2a commented 1 month ago

I also wonder whether it is possible to read an exact number of bytes instead of the loop for Vec. But I haven't figured out how to specialize the trait implementation for Vec over the generic Vec.

I think the solution here might just be the use a BufReaader, which will read larger amounts at a time instead of one byte. Syscalls wise at least.