tuffy / bitstream-io

Library for reading/writing un-aligned values from/to streams in big-endian and little-endian formats
Apache License 2.0
52 stars 20 forks source link

[WIP] Fix addition of zero values during pushes and logic error. #2

Closed tobz closed 7 years ago

tobz commented 7 years ago

This was otherwise a bug where reading large values, with constituent bits spanning multiple bytes, could lead to a situation where partial reads resulted in a zero value for that read alone, which otherwise would properly shift the accumulated value to generate the correct result, but were getting silently dumped on the floor.

For example:

Bitstream, little-endian ordering: [(00000001)] [000000(00)]

Each set of values in brackets is a single byte, and the target value we want to extract is in the parentheses, spanning 10 bits across two bytes. With the previous code, if we read the remaining two bits, we'd get a zero value, giving us an accumulated value of one, when the real value is actually four.

As well, the push method for LittleEndian was improperly shifting the value, instead of the accumulator, with the accumulated bits instead of the bits for the given value.

The biggest issue here is that the unit tests are all out of whack after this change. I've yet to tackle them because I wanted to run this by you first. For me, this was a legitimate bug, but maybe for your use case, it's not?

Any deep thoughts here? I've gone back and forth on whether or not I should just copypasta the bits I need into my project and be done with it, but figured I'd see if you thought it made sense to forge ahead getting these things upstream.

tuffy commented 7 years ago

For all the little-endian files I've seen, bytes are packed with the most-significant bits last. For instance, take the first 4 bytes of an Intel ordered TIFF:

use std::io::Cursor;
use bitstream_io::{LE, BitReader};

let tiff: [u8;4] = [0x49, 0x49, 0x2A, 0x00];
let mut c = Cursor::new(&tiff);
let mut r = BitReader::<LE>::new(&mut c);
let mut buf = [0,0];
assert!(r.read_bytes(&mut buf).is_ok());
assert_eq!(&buf, b"II");
assert_eq!(r.read::<u16>(16).unwrap(), 42);

WavPack, TrueAudio and other little-endian formats I've seen work similarly. Do you have an example format I could poke at?

tobz commented 7 years ago

Yeah, I'll see if I can whip up a small reproduction case to illustrate my point.

tobz commented 7 years ago

This should do it:

extern crate bitstream_io;

use std::io::Cursor;
use bitstream_io::{BitReader, LittleEndian};

fn main() {
    let rv_dflt: [u8;10] = [0x3D, 0x59, 0xF4, 0xC6, 0x01, 0x00, 0x44, 0x66, 0x6C, 0x74];

    let mut c = Cursor::new(&rv_dflt);
    let mut r = BitReader::<LittleEndian>::new(&mut c);

    // Read the random value.
    let rv = r.read::<u32>(32).unwrap();
    assert_eq!(rv, 1029305542);

    // Read our length value for the game cache string.
    let buf_len = r.read::<u32>(10).unwrap();
    assert_eq!(buf_len, 4);

    // Line things up.
    r.byte_align();

    // Read the game cache value.
    let mut buf = vec![0u8; buf_len as usize];
    r.read_bytes(buf.as_mut_slice()).unwrap();
    assert_eq!(buf.as_slice(), b"Dflt");
}

It immediately will blow up on master trying to compare the "random value", but that's because of the value <<= *bits_acc line which is busted. If you fix that line, it blows up on getting buf_len as it reads 8 bits (0b00000001) and then goes to read the next 2 bits (0b00). If those two bits don't get pushed onto the accumulator (value_acc at 0b00000001 shifted two bytes to the left == 0b0000000100) then the value stays at 1.

This data is from a non-open standard for Starcraft 2/Heroes of the Storm replay files -- these are popular online games.

Now, here's the really wild thing: the way that both my reference code (that I'm using as a reference for writing a Rust-based version) and your code work is that for a little-endian 32-bit value... it will actually read it in big endian!

If you look at the byte order above, for the first four bytes, for those to equal the value I assert it to be -- which is also what my reference code says the integer value of those those bytes represents -- you have to interpret them in big-endian order.

The way that BitQueue is being used to hold the output value of a single read, it's only ever going to read the bits in big-endian order.

tobz commented 7 years ago

So, I'm a little inebriated so my thoughts might not be fully cogent here, but:

my reference code refers to itself as reading little-endian values from a given stream... but this is clearly not the case. It is reading it, as your code does, in big-endian format. However, when I use big-endianness with BitReader, things do not work for me as they do when I'm using little-endianness... so now I'm just really confused. :)

Am I somehow just wildly misinterpreting endianness? Have I been staring at this so long that I have it backwards? Or does the code for both endiannesses in bitstream_io have it backwards?

tuffy commented 7 years ago

Simply switching LittleEndian to BigEndian in your code makes all the assertions pass, so your data must be big-endian. Those initial "Random Value" bytes are a dead giveaway. In a big-endian stream with the most-significant bits first, the value is:

0x3D, 0x59, 0xF4, 0xC6 == 0x3D59F4C6 == 1029305542

If it were a little-endian stream with the least-significant bits first, that value would be:

0x3D, 0x59, 0xF4, 0xC6 == 0xC6F4593D == 3337902397

Though Rust's byteorder crate only works on byte-aligned values, it should give the same results for a 32-bit value in each endianness.

tobz commented 7 years ago

Ugh, the annoying thing is that you're right that this is clearly big-endian behavior but my code only works against this PR while using little endianness. Now that I've spent more time looking, though, I understand what Endianness is doing in the push method for LE and that it's not a bug.

Here's where it's still weird, though:

extern crate bitstream_io;

use std::io::Cursor;
use bitstream_io::{BitReader, BigEndian};

fn main() {
    let init_start: [u8;20] = [0x10, 0x0a, 0x65, 0x72, 0x65, 0x76, 0x6c, 0x79, 0x64, 0x65, 0x75, 0x78, 0x01, 0x00, 0x02, 0xfc, 0xff, 0xff, 0xff, 0x07];

    let mut c = Cursor::new(&init_start);
    let mut r = BitReader::<BigEndian>::new(&mut c);

    let pl = r.read::<u32>(5).unwrap();
    assert_eq!(pl, 16);

    // Read our length value for the game cache string.
    let name_len = r.read::<u32>(8).unwrap();
    assert_eq!(name_len, 10);

    // Line things up.
    r.byte_align();

    // read the player name
    let mut buf = vec![0u8; name_len as usize];
    r.read_bytes(buf.as_mut_slice());
    assert_eq!(buf.as_slice(), b"erevlydeux");
}

So, it's quirky in that things are assembled in big-endian order but they're intended to be read LSB first.

Seems like at this point I just need to write my own bitstream reader, because clearly this format is not following the big endian/MSB unwritten rule.

Appreciate you doing the back-and-forth since it helped me grok this weirdness better. :D