kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

Add support for a streaming API (as in JSON or XML streaming parsers) #449

Open lithp opened 6 years ago

lithp commented 6 years ago

I'm trying to write a parser for the Postgres protocol.

The biggest problem comes from a combination of two things:

  1. Postgres follows TLV for every packet except for the first packet which the client sends to the server. Those should be parsed differently.

    I would like to be able to do something like:

    seq:
      - id: startup
        type: startup_message
      - id: messages
        type: message
        repeat: eos

    However,

  2. Kaitai has no support for parsing new messages as they come in.

So instead, I have to write special code in my application. I have do to something like:

first_message = True
def on_message(message):
  global first_message
  if first_message:
    first_message = False
    return pgprotocol_initial.Pgprotocol_initial(message.content)
 return pgprotocol_subsequent.Pgprotocol_subsequent(message.content)

What I would much prefer is to be able to write something like:

kaitai_parser = pgprotocol.StreamingParser(stream)
def on_message(stream, kaitai_parser, message):
  stream.write(message.content)
  if kaitai_parser.has_next():
    return kaitai_parser.next()

It's not a problem that I have, but a streaming parser would also reduce memory requirements when parsing large files, which I'm sure other people would appreciate.

KOLANICH commented 6 years ago

KS doesn't suit well for describing protocols. I guess best we can do is to describe messages formats in KS and protocols in a dialect of Verilog with integrations to KS and compile everything to target languages.

GreyCat commented 6 years ago

This is a relatively complex question.

The simple answer is that whole current implementation of parsers that current ksc generates heavily relies on being as "stateless" as possible and thus normally provides the API that can process the message (i.e. type) with only 2 outcomes:

It's actually ok for many tasks, but not for all. Currently, there's no "third" way, i.e. any way to get semi-parsed type, or to get the parsed data gradually (i.e. not as a whole object, but, for example, via the series of callbacks).

The longer answer is that fundamentally, there's nothing wrong with .ksy language per se to implement stuff like that, but so far it wasn't really implemented. There were several discussions on that matter already, I'll try to recall most popular suggestions:

Keep "state" of parsing + allow access to incomplete object

This basically means that for a given seq like this:

seq:
  - id: a
    type: u1
  - id: b
    type: u1
  - id: c
    type: u1

... instead of generating stuff like:

        this.a = this._io.readU1();
        this.b = this._io.readU1();
        this.c = this._io.readU1();

we will be keeping state of the parsing in a distinct variable, i.e. something like:

// ctor
_state = 0;

// _read
if (_state <= 0) {
    this.a = this._io.readU1();
    _state = 1;
}
if (_state <= 1) {
    this.b = this._io.readU1();
    _state = 2;
}
if (_state <= 2) {
    this.c = this._io.readU1();
    _state = 3;
}

This way, we can treat end-of-stream errors as non-fatal: namely, if EOS is encountered, we can keep the object in semi-constructed state (and even use it somehow, if it makes sense), and, as the time progresses and more data would become available in the stream, repeated calls to _read() would effectively resume parsing from where it has stopped/crashed for any reason.

Pros

Relatively simple model, i.e. from end-user's point of view, one just needs to call _read() => check if the result is complete enough to be useful => use it => repeat. From developer's point of view, this one actually can be combined more or less seamlessly with "stateless" parsers as well, by introduction of some checkpoints (i.e. "subtype X not parsed => rewind to start of X, next time we'll be attempting to re-parse whole X"). Might integrate ok with tools like IDE.

Cons

Does not really help with parsing huge lists of objects that are not supposed to fit in memory. Can be relatively tricky and non-trivial for end-user to make decisions based on that _state (no to mention that this naïve integer state model is super ugly and prone to breaking).

Don't generate object at all and use callback/pubsub-based model

This proposal is much bolder and involves a very different usage model for KS-generated code. This is close to what you're proposing, i.e. instead of getting a single object with properties a, b, c for the example above, we'll be expected to supply callbacks like on_a, on_b, on_c (or subscribe one's handlers to these publication events, which is effectively more or less the same). This way, generated code would be something like that:

interface FooConsumer {
  void onA(int a);
  void onB(int b);
  void onC(int c);
}

void _read(FooConsumer consumer) {
    this._io.readU1().onSuccess((a) => {
        consumer.onA(a));
        this._io.readU1().onSuccess((b) => {
            consumer.onB(b));
            this._io.readU1().onSuccess((c) => consumer.onC(c));
        });
    });
}

Obviously, this implies that the runtimes must be also rewritten in async way and the target language supports some way to do that via callbacks / anonymous functions / etc. Getting to the end of stream in this paradigm is normally not considered an error, but just means that the callback would be delayed until all the data is there.

Pros

Allow very fine tuning of whatever is happening with the parsed data, i.e. if one wants a certain in-memory object, it's totally possible, but one needs to build it and track it manually; plays along well with totally async models like node.js; can handle huge repeated stuff like endless list of packets well.

Cons

Lots of manual (and error-prone) work, callback hell totally possible, very different from current concept. Always requires tons of custom user-written code to be useful. Not likely to be ever integrated into some rapid tools like web IDE.


Bottom line: no consensus for supporting either of (or both of) these proposals has been made, and nobody actually even done any proof-of-concepts for these proposals. In theory, first one would be easier to implement, but it won't cover all the use-cases. Second one brings KS closer to how traditional parser generators (like Lex/Yacc/etc), but it would kind of ruin the original intent of KS being a rapid development tool, providing a "good enough" default stream-to-object mapping.

lithp commented 6 years ago

Wow, thanks for the thorough response, this is much more than I was expecting!


That first option is frustrating because it's just about the perfect use-case for coroutines. Just run the parser in a coroutine which pulls data from a channel and blocks when there's no data left. Each member has an extra field called something like "completed", which you can inspect to see which things are ready for you to look at.

That doesn't map well to most languages though. I still like the idea of putting an isComplete method onto each object which you can check to see if it's ready to be read. If every language supports "peek"ing into the stream then you can just save a pointer to the last object to be completely parsed, and not pull data out of the stream until there's enough of it to complete another object. Pull just the bytes necessary to complete the next object and repeat until there the only bytes left don't fit anywhere. Leave those in the stream and return control to the caller.


And you're right that the second one is the more direct equivalent to a streaming api. Personally, I'd prefer the first one, but probably the second one is better, since it satisfies both use-cases at once.

dgutson commented 6 years ago

"on't generate object at all and use callback/pubsub-based model" -> this is much like a yacc parser, where the user writes the actions. As I see it, kaitai imposes a data model, then people (that may have their own data model) need to write code that gets the instances and adapts it to the target model. A trivial example is: I want to convert raw data into json. The very reason that I need to write and maintain the gluing code causes us to not use such an interesting parsing as kaitai: because it breaks the single responsibility principle, instead of parsing (which does just right), it imposes container classes that require extra code, extra memory, and extra runtime time (store, retrieve, convert).

GreyCat commented 6 years ago

@dgutson There are several approaches possible, and the fact that KS offers a certain in-memory model for the data can be both viewed as a good thing or as a bad thing. For many application, this "imposed" model is just fine to work with, and it actually saves a lot of keystrokes to avoid designing something like that manually.

This is something closer to XML parsing that you can do DOM-style or SAX-style. Both can get the job done, but DOM is usually easier to use for most projects, and SAX requires more effort, but potentially yields better performance in some scenarios.

Data-to-JSON convert actually exists in current form, it's available as ksdump utility, and it heavily relies on Ruby's internal representation of data loaded.

It's possible to implement pubsub-style (or whatever you want to call it — yacc-style, stream-style, etc) parsing using .ksy specs, and hopefully it would be eventually done, but it's not a top priority task right now.

jeromew commented 4 years ago

@lithp I just discovered this old issue. I am trying to see how I could use kaitai with the postgres protocol. I understand that the kaitai runtimes are not currently handling long-lived protocols ("2. Kaitai has no support for parsing new messages as they come in.") but that a kaitai protocol description could make sense anyway. Did you happen to write a .ksy file for the postgres protocol and commit it somewhere ? Thanks.

KOLANICH commented 3 years ago

The idea from #843.

  1. a spec designer decides where to put checkpoints. Checkpoints are the points in parsing state machine of seqs.
seq:
  - id: a
    type: u1
  - id: b
    contents: [0xFF]
  - id: point1
    type: _checkpoint
  - id: c
    type: u1
  - id: point2
    type: _checkpoint
  - id: d
    type: u1
  1. these funcs are parts of the runtime

def _read(self, finalState=None): """Final state is used to process loops""" self._finalState = finalState self._nextCheckpoint = self.class.INITIAL_CHECKPOINT self._readLoop()

def _readLoop(self): while(self._nextCheckpoint is not None): self._nextCheckpoint = self._nextCheckpoint()


3. The transpiler generates the code

```python
def _checkpointImplicit0(self):
    this.a = this._io.readU1();
    this.b = this._io.ensure_fixed_contents(b"\xff");
    return self.point1

INITIAL_CHECKPOINT = _checkpointImplicit0

def point1(self):
    this.c = this._io.readU1();
    return self.point2

def point2(self):
    this.d = this._io.readU1();
    return self._finalState
  1. if the code fails, user can resynchronize himself
while s._nextCheckpoint is not None:
    try:
        s._read()
    except:
       nextFFOffset = findNextFF(s._io)
       s._io.seek(nextFFOffset + 1)
       s._nextCheckpoint = s.point1
  1. For network streams essentially the same code, just no seeks, instead a buffer and blocking calls
lastReadPos = 0 
while s._nextCheckpoint is not None:
    try:
        s._read()
        lastReadPos = s._io.pos()
    except:
       getBytesFromTheSocket(s._io)
       s._io.seek(lastReadPos)

Compiler must generate checkpoints only if it is required. If not, it should generate the code it generates currently.