kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.97k stars 194 forks source link

Combination of repetition and "if" condition causes an error #281

Open deefha opened 6 years ago

deefha commented 6 years ago

I have binary file with following structure:

Any FAT offset can contains zero. Such offset is not valid and have no corresponding content in data section, thus reading data section can be skipped for that offset. I hope I explained it correctly, although it is not so important again.

So I have following Kaitai Struct (excerpt):

...

seq:
  - id: header
    type: t_header
  - id: fat
    type: t_fat

instances:
  fonts:
    pos: fat.offsets[_index]
    type: t_font(fat.offsets[_index])
    if: fat.offsets[_index] != 0
    repeat: expr
    repeat-expr: 63

types:
  t_header:
    seq:
      ...

  t_fat:
    seq:
      - id: count
        type: u4
      - id: offsets
        type: u4
        repeat: expr
        repeat-expr: 63

  t_font:
    params:
      - id: offset
        type: u4
    seq:
      ...

As you can see, in "fonts" instance I'm using simple repetition and absolute positioning by previously gathered FAT offsets. There is also "if" condition for skipping zero offsets. I'm using parametric type "t_font", because why not (my ksc version is 0.8-SNAPSHOT).

The problem is in generated source. For "fonts" instance mentioned above I get this Python code (excerpt):

if self.fat.offsets[i] != 0:
    _pos = self._io.pos()
    self._io.seek(self.fat.offsets[i])
    self._m_fonts = [None] * (63)
    for i in range(63):
        self._m_fonts[i] = self._root.TFont(self.fat.offsets[i], self._io, self, self._root)

    self._io.seek(_pos)

This is obviously incorrect and causes an error (use of variable "i" before exists). Correct code should be slightly rearranged (condition inside repetition):

_pos = self._io.pos()
self._m_fonts = [None] * (63)
for i in range(63):
    if self.fat.offsets[i] != 0:
        self._io.seek(self.fat.offsets[i])
        self._m_fonts[i] = self._root.TFont(self.fat.offsets[i], self._io, self, self._root)

self._io.seek(_pos)

Same situation occurs e.g. in generated PHP source (excerpt), so I think this problem is global indeed:

if ($this->fat()->offsets()[$i] != 0) {
    $_pos = $this->_io->pos();
    $this->_io->seek($this->fat()->offsets()[$i]);
    $this->_m_fonts = [];
    $n = 63;
    for ($i = 0; $i < $n; $i++) {
        $this->_m_fonts[] = new \KlanFont\TFont($this->fat()->offsets()[$i], $this->_io, $this, $this->_root);
    }
    $this->_io->seek($_pos);

I know that my examples are not general, but I cannot describe problem better. I just expected different generated source code than I got. I'd like to help with further exploration.

GreyCat commented 6 years ago

Well, first of all, I don't think that you to use _index and parametric types here at all. This would probably do:

seq:
  - id: header
    type: t_header
  - id: count
    type: u4
  - id: fonts
    type: font_offset
    repeat: expr
    repeat-expr: 63
types:
  font_offset:
    seq:
      - id: offset
        type: u4
    instances:
      body:
        pos: offset
        type: font_body
        if: offset != 0
  font_body:
    seq:
      # real parsing of a font body happens here
GreyCat commented 6 years ago

Anyway, what you're talking about here is somewhat important stuff. There is actually a problem, but probably not the one you've had in mind.

Given all that stuff, it's unlikely that we'll change if + repeat behavior.

instances:
  pos: 0x1000
  type: foo
  repeat: expr
  repeat-expr: 0x10

Right now it reads 0x10 items of foo, starting at offset 0x1000, but if we'll make pos: work inside a loop, then we'll be getting 0x10 items of same data, every item starting at offset 0x1000. I understand that it's tempting to use pos: something[_index], but it would be pretty awkward to make it incompatible with previous pos function and/or to introduce some sort of autodetection on where to put that position setting — inside or outside of the loop.

To summarize that, _index is still somewhat experimental feature. As mentioned in #147, it doesn't do lots of checks it should do. Namely, we should add checks, as outlined in this comment. I'm not 100% sure about pos + _index, but I'm currently inclined to ban it too, and recommend to use it via extra type layer.

deefha commented 6 years ago

OK, thank you for the explanation, finally I understand.

I have no problem with proposed solution (i.e. refactoring my struct), it was actually the very first version of struct I created. For some reasons, I wanted to reflect logical organization of file (header, FAT, data) in generated source, to be able to access these parts independently. But it is not absolutely necessary, so I can use simpler (and as a matter of fact cleaner and much more elegant) approach.

Let's say I just wanted to test the new "_index" feature, which per se I consider really useful :-)

GreyCat commented 6 years ago

Well, as I said, just for the sake of experimenting, you can use _index like with an intermediate type, albeit it's kinda awkward:

seq:
  - id: header
    type: t_header
  - id: count
    type: u4
  - id: offsets
    type: u4
    repeat: expr
    repeat-expr: 63
  - id: fonts
    type: font_offset(_index)
    repeat: expr
    repeat-expr: 63    
types:
  font_offset:
    params:
      - id: idx
        type: u4
    instances:
      body:
        pos: _parent.offsets[idx]
        type: font_body
        if: _parent.offsets[idx] != 0
  font_body:
    seq:
      # real parsing of a font body happens here
deefha commented 6 years ago

I see, using "_index" as type parameter - that's smart! Thank you again :-) It solves almost everything for me.