kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.01k stars 196 forks source link

Silent failure when a field references a subsequent field #469

Open webbnh opened 6 years ago

webbnh commented 6 years ago

(I tried to search for similar issue reports, but failed to find any -- sorry if this is a duplicate.)

I have a .ksy file containing the following:

...
types:
  fileheader:
    seq:
...
    - id: serial_number
      process: xor(xor_key & 0xff)
      size: 16
    - id: hash
      process: xor(xor_key & 0xff)
      size: 32
    - id: xor_key
      type: u2
...

The generated code reads the fields in order (which is not surprising), but it processes serial_number when it is read, when xor_key is still uninitialized.

I would have expected either that KSC would have issued a warning/error at compile time or that it would have deferred the processing of serial_number. (Replacing the direct reference to the xor_key expression with an value instance yields the same result.) Having it "silently fail" (i.e., produce bad results at runtime) is...disappointing.

Any suggestions on how I should implement this? (I would expect that I should be able to define an value instance for the serial number, but I don't know how to specify a process attribute for an instance, and I don't know how to apply an XOR operator to each byte of a byte array....)

Thanks!

GreyCat commented 6 years ago

ksc currently provides no guarantees like that, i.e. data in seq is supposed to be read/written in sequence, thus there is no automatic "deferred processing" here by design. I agree that adding a warning and/or error message on precompile step would be nice, but it might be not very trivial to do so, for example, this is easy:

seq:
  - id: a
    size: 16
    process: xor(b)
  - id: b
    type: u1

a uses b, b is defined after a => problem. This is trickier:

seq:
  - id: a
    size: 16
    process: xor(c)
  - id: b
    type: u1
instances:
  c:
    value: b + 1

a uses c, which requires b, which is undefined at time then it will be used. I would assume that there are much trickier ones involving conditions (i.e. a ? b : c), involving recursion and boiling down to a complexity of halting problem.

GreyCat commented 6 years ago

That said, I personally never found it to be a big problem, as typically a target language compiler is good in issuing warning about such behavior in generated code (i.e. usage of uninitialized variables).

KOLANICH commented 6 years ago

133 and #65 are related

webbnh commented 6 years ago

I think the fact that KSC wraps all accesses to the fields inside nice accessor functions masks the fact that the accesses might be to uninitialized memory. (There's no way, inside the accessor function, to tell that the field is uninitialized, and there's probably no way in the calling function (_read()) to tell what the accessor function does, without in-lining at the source code level.)

So, I'm willing to buy into the fact that my .ksy is bad, and I'm willing to grant that it might be hard to detect the use-before-read case, but it might not be hard to defer the processing step, which would solve at least my problem.

The code generated by KSC for me look similar to this:

void records_t::fileheader_t::_read() {
...
    m__raw_serial_number = m__io->read_bytes(16);
    m_serial_number = kaitai::kstream::process_xor_one(m__raw_serial_number, (xor_key() & 255));
    m__raw_hash = m__io->read_bytes(32);
    m_hash = kaitai::kstream::process_xor_one(m__raw_hash, (xor_key() & 255));
    m_xor_key = m__io->read_u2le();
...
}

At least conceptually, it seems like it would be straightforward for the compiler to emit the calls to process_xor_one() after emitting the last read call.

In the absence of that compiler feature, how can I specify to KSC that I want the field processed by a value that I haven't read yet? Can I create a type and apply it lazily? (Or, is there some way to do the processing in a value instance?)

GreyCat commented 6 years ago

Ok, so there are 2 distinct questions:

  1. How to detect bad situations with uninitialized data / do warnings, preferably in ksc compile-time?
  2. How to implement what you want to do in .ksy, i.e. out-of-order parsing?

Second one is easier to answer to in some respects. Existing solutions are:

All manual

seq:
  - id: placeholder
    size: 16 + 32
  - id: xor_key
    type: u2
instances:
  serial_number:
    pos: 0
    size: 16
    process: xor(xor_key & 0xff)
  hash:
    pos: 16
    size: 32
    process: xor(xor_key & 0xff)

Hacky but automatic

seq:
  - id: serial_number_placeholder
    size: 16
    if: serial_number_pos >= 0 # always true, but stores pos
  - id: hash_placeholder
    size: 32
    if: hash_pos >= 0 # always true, but stores pos
  - id: xor_key
    type: u2
instances:
  serial_number_pos:
    value: _io.pos
  serial_number:
    pos: serial_number_pos
    size: 16
    process: xor(xor_key & 0xff)
  hash_pos:
    value: _io.pos
  hash:
    pos: hash_pos
    size: 32
    process: xor(xor_key & 0xff)

And one not-yet-implemented yet lazy seq solution:

seq:
  - id: serial_number
    process: xor(xor_key & 0xff)
    size: 16
    lazy: true
  - id: hash
    process: xor(xor_key & 0xff)
    size: 32
    lazy: true
  - id: xor_key
    type: u2

This is obviously not a panacea, but typically wrapping everything possible in lazy: true will help untangling stuff like that. Obviously it couldn't help with stuff where positions/sizes are not fixed, infinitely recursive stuff, etc.

GreyCat commented 6 years ago

Commenting some of the questions / proposals:

I think the fact that KSC wraps all accesses to the fields inside nice accessor functions masks the fact that the accesses might be to uninitialized memory.

First of all, KSC does not do it everywhere. For example, in JavaScript and Python, one just accesses raw object's variables using foo.bar syntax, there are no getter functions involved. In Go, there is actually a painful distinction between seq fields, which are available as foo.Bar (i.e. just variable access in a structure) and instances, which require foo.Bar() call, which is not only returning the value of the instance, but also an error code (as any instance might hit an error during parsing).

Second, I don't think that detection of uninitialized memory access is particularly hard in runtime even when it's not done automatically by the language, i.e. we can in theory prepare an extra boolean field for every attribute that will start as false, and will be set to true after successful definition being done. Then, a getter for foo would actually look like:

if self._i_foo:
    return self.foo
else:
    raise Exception("foo is accessed, but it is yet undefined")

May be it's a good idea to have some sort of flag to generate that kind of "safety" machinery. It will be obviously slower (and less readable), but definitely safer and won't allow that silent corruption you've mentioned.

At least conceptually, it seems like it would be straightforward for the compiler to emit the calls to process_xor_one() after emitting the last read call.

No, it's not:

seq:
  - id: a
    type: foo
    size: 16
    process: xor(0xaa)
  - id: b
    size: a.some_attribute

In this case, b is dependent on a being fully parsed and available to determine b's size.

FireyFly commented 6 years ago

I don't think it's completely infeasible to support this in "sane" cases where the graph of "field cross-references" forms a DAG: for each field, you presumably know which other fields its process function will reference, statically. You can then perform a topological ordering, and then compute the process functions in that order.

(That said, it's definitely a new feature, I guess)

GreyCat commented 6 years ago

@FireyFly It is definitely possible, but I just wanted to point you that it's trickier than it looks like. process is actually the easiest of them all, as it does not really affect processing of the stream beyond the attribute in any way. size, if, type switching, repeat-expr, repeat-until — all these guys do affect stream state. This one is unsolvable for reading, but technically fully solvable for writing, for example:

seq:
  - id: a
    size: b
  - id: b
    type: u4

And this one looks very similar it, but it is actually totally ok, as sum of size(a)+size(b) is const.

seq:
  - id: a
    size: 20 + c
  - id: b
    size: 20 - c
  - id: c # actually always starts at constant offset = 20 + c + 20 - c = 40
    type: u4
webbnh commented 6 years ago

@FireyFly, I think constructing and traversing the DAG would allow the compiler to flag cases which aren't sane (which would have been a boon to me...). However, as @GreyCat has demonstrated, the constraint of having to read the fields in order can conflict with a requirement to process them out of order.

@GreyCat, thanks as always for your prompt attention and cogent responses. I plan to pursue the "Hacky but automatic" solution -- since the fields occur in the middle of my record, have to calculate and maintain the offsets manually is unappealing. And, as for referencing the fields out of order, that is the sort of thing up with which I shall not put. ;-)