spider-gazelle / bindata

BinData - Parsing Binary Data in Crystal Lang
MIT License
48 stars 5 forks source link

Determining how many bytes have been consumed #8

Closed HotPocketRemix closed 4 years ago

HotPocketRemix commented 4 years ago

Is there a way to check the pos of the IO stream as part of a format specification? I didn't see it in the examples, but I'm relatively new to Crystal so I'm not sure if I just missed it. Consuming the remaining bytes isn't appropriate because this is in the middle of the format, not at the end.

More concretely, a few formats that I'd like to parse require alignment after some of the data has been read (for example, after a data chunk has been read, there are padding bytes until the length of the stream has reached a multiple of 16), and normally I'd read a dummy array of the appropriate size and verify that all the entries are 0 but to compute the size of that array, I'd need to know the current position of the stream. I could compute it manually, but if there's a lot of data - especially nested data - before that point, it would be very difficult to keep track.

(To be clear, I can't just round up the size of the data chunk to the next multiple of 16, because the format pads against the length of the entire file read so far, not just the data chunk.)

stakach commented 4 years ago

padding based on the entire file read is a bit annoying as you can't use an array, is assume. you're probably best to break up the parsing.


class Header < BinData
    endian big
    uint8 :header_data
end

class Chunk < BinData
    endian big
    uint8 :struct_entries_etc
end

header = io.read_bytes(Header)
chunks = [] of Chunk
loop do
  break if io.closed?
  chunk << io.read_bytes(Chunk)
  io.skip calculate_padding(io.pos)
end
stakach commented 4 years ago

You can still have a parent class that provides a nice interface.

# continuing from the example above
file = io.read_bytes(StreamingFile)
file.header
file.chunks

Example where I break up the parsing https://github.com/spider-gazelle/crystal-bacnet/blob/master/src/bacnet/secure_message.cr#L18

HotPocketRemix commented 4 years ago

Unfortunately, the data is contained inside a nested bunch of several other structures, so to split parsing in the middle would also be very difficult. Not the friendliest of format I have to deal with!

Basically, it's something like a RIFF structure, where chunks must be of even length (though in my case, divisible by 16 instead), but chunks can also have subchunks, and those chunks can have subchunks, etc, so it may not even be known ahead of time how many chunks there are total, but each chunk is still padded. RIFF at least just pads the length of each chunk to be even, as opposed to the whole stream so far.

I'll see if I can make some assumptions to help split up the data so I don't have to do so much computation.

stakach commented 4 years ago

would it be useful if the IO#pos was passed to the length callbacks? i.e.

bytes :padding_bytes, length: ->(io_pos : Int32) { calculate_padding(io_pos) }

That might be possible without being a breaking change

stakach commented 4 years ago

Actually by pure coincidence it looks like the io is available within the callbacks so you can do

bytes :padding_bytes, length: -> { calculate_padding(io.pos) }
HotPocketRemix commented 4 years ago

Oh, I didn't even think to try that! That should work perfectly, thanks!