kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

Parsing hooks #285

Open KOLANICH opened 7 years ago

KOLANICH commented 7 years ago

For now if an error have occured the parsing stops, throws an error and all the parsing result is discarded.

For example a file format (```pcap``` in the case I'm working on now) contains some records. And it throws somewhen. I have added ```print``` calls to the generated parser (I have added a couple (one is for windows, one is for linux) of records' types to parse USB captures, I'm going to release them soon), the log makes sense (at least the enums values and sizes are the ones that expected), but sometimes a broken record is processed and all the result is discarded. The broken record has incorrect and insane value for size, which result in running out of stream boundaries. The files were generated by Wireshark (converted from pcapng), so I assume them valid. For now I have not found what happened.

I guess there may be good use cases when parsing broken files is needed. So we need a way to deal with such cases. In particular we need a way to capture the exceptions produced and do some custom processing.

IMHO the best way to deal with it is to have some hooks for that. What the hooks do we need? For every property in a struct KSC should create a method, which is used to parse that field. For parsing a sequence it should create a pair of methods, one is for parsing an item and another one using the latter one is for parsing a sequence. Methods should be deduplicated. I mean, that if we have n > 1 properties having the same type, the compiler should generate, depending on the language, either n+1 methods (1 method to read the type and n references to it to read members).

What should the methods signatures look like? I propose

def _read_<field_name>(self, fieldName, buffer)

for fields and instances

def _read_element_<field_name>(self, fieldName, index, buffer)

for sequence elements

def _read_type_<type_name>(self, fieldName, buffer)

for types.

The example

meta:
  id: O
seq:
  - id: a
    type: u4
  - id: b
    type: u4
  - id: c
    type: u4
    repeat: expr
    repeat-expr: b

should generate something like

class O(...):
   ...
   def _read_type_u4(self, fieldName, buffer):
     ...

   _read_a=_read_b=_read_element_c=_read_type_u4
   def _read_c(self, fieldName, buffer):
      for ...:
         ...=self._read_element_c(...)
GreyCat commented 7 years ago

Hooking is indeed an interesting feature, yet so far I fail to understand how it would help you in this particular case.

For now if an error have occured the parsing stops, throws an error and all the parsing result is discarded.

Actually, in --debug mode, you'll "best effort"-filled structure, i.e. everything except for the last element that failed to be read. This is done by clear separation of "object creation" and "object attributes setting from read values" stages, i.e. the API changes:

// Normal API
a = new Foo(kaitaiStream);

// Debug API
a = new Foo(kaitaiStream); // never fails due to read errors
try {
  a._read();
} catch (...) {
  // ...
}
// "a" would still exist afterwards, i.e. here

But, AFAIR, it's not implemented for Python.

The broken record has incorrect and insane value for size, which result in running out of stream boundaries.

The problem with that is that to resume you need some valid size, and it's nowhere to be found. If you know that, for example, length of 0xffffffff means zero length of data, you can do that in expression language:

- id: data
  size: 'len_data == 0xffffffff ? 0 : len_data'

If you don't know something like that, then I don't really understand how hooking would help.

KOLANICH commented 7 years ago

That struct was in a substream of known size, so failure to parse it won't break the rest of stream. See pcap.ksy and assume the struct in the body has type replaced with usbpcap.

Debug API won't give you the way to skip the failed element and continue parsing, yo need to be inside the loop to do it. Introducing hook points will allow you to save a ksc-generated function parsing an element and replace it with a new one calling the KSC-generated one, which can catch exceptions or do pre- and postprocessing or do anything you like. This doesn't require any modification of a ksy fike.

KOLANICH commented 6 years ago

I have also thought about extension point mechanism in context of signature matching: the type refers a special type, which searches for signatures (all the signatures (the ones marked as signatures with a hint) from the library) and applies the first successfully parsed one.

arekbulski commented 6 years ago

Construct does support lazy parsing (so corrupted data can be accidentally skipped) but it does not support these kind of semantics.

KOLANICH commented 6 years ago

I have created a separate issue, since I guess it's better to express that in KS than in third-party code. The hook is still useful for that if we need more complex decisions, for example involving ML.

arekbulski commented 6 years ago

Construct added hooks, see docs: https://construct.readthedocs.io/en/latest/basics.html#processing-on-the-fly

Its worthy to point out that there is a 2nd, related feature. GreedyRange had added discard option, so that each item can be parsed, processed by the hook, and then discarded. This was gigabyte sized files can be parsed without using gigabytes of RAM.