kaitai-io / kaitai_struct_formats

Kaitai Struct: library of binary file formats (.ksy)
http://formats.kaitai.io
711 stars 204 forks source link

JPEG format file consumes too much image data #639

Closed chalford closed 1 year ago

chalford commented 1 year ago

Hi! I've just discovered Kaitai and I'm loving it!

I'm using the JPEG .ksy from the format library, and I've noticed that in the SOS segment, the image_data attribute is set to size-eos: true. However, because the SOS segment does not include the full length of the segment in its header (just the header length), this type cannot be assigned a substream. As a result, the image_data attribute consumes data all the way up to the end of the file (which includes other markers / sections that should be parsed).

My understanding is that JPEG requires decoders to scan through the stream of image data for the start of the next marker (0xFF). However, that is made more complex by the inclusion of "restart markers" and "byte stuffing" that should be ignored for the purposes of determining the end of the segment.

Maybe we'd need an exclusion list for a terminator, so that the new image_data attribute looks something like this:

      - id: image_data
        terminator: 0xFF
        terminator-lookahead-exclude:
          - 0x00
          - 0xD0
          - 0xD1
          - ...
        consume: false
        if: marker == marker_enum::sos

In that scheme, we'd need to somehow read ahead a byte when a 0xFF terminator was encountered, to see if it was in the exclude list. If it was, ignore the terminator and continue to read the image_data. If it wasn't, complete the image_data read and move on to the next attribute/type in the sequence.

However, as I'm brand new to Kaitai, it's very possible that there's a way to do this with the existing syntax?

chalford commented 1 year ago

Realised I've raised this in the wrong repo...