Open lithp opened 6 years ago
KS doesn't suit well for describing protocols. I guess best we can do is to describe messages formats in KS and protocols in a dialect of Verilog with integrations to KS and compile everything to target languages.
This is a relatively complex question.
The simple answer is that whole current implementation of parsers that current ksc generates heavily relies on being as "stateless" as possible and thus normally provides the API that can process the message (i.e. type) with only 2 outcomes:
It's actually ok for many tasks, but not for all. Currently, there's no "third" way, i.e. any way to get semi-parsed type, or to get the parsed data gradually (i.e. not as a whole object, but, for example, via the series of callbacks).
The longer answer is that fundamentally, there's nothing wrong with .ksy language per se to implement stuff like that, but so far it wasn't really implemented. There were several discussions on that matter already, I'll try to recall most popular suggestions:
This basically means that for a given seq like this:
seq:
- id: a
type: u1
- id: b
type: u1
- id: c
type: u1
... instead of generating stuff like:
this.a = this._io.readU1();
this.b = this._io.readU1();
this.c = this._io.readU1();
we will be keeping state of the parsing in a distinct variable, i.e. something like:
// ctor
_state = 0;
// _read
if (_state <= 0) {
this.a = this._io.readU1();
_state = 1;
}
if (_state <= 1) {
this.b = this._io.readU1();
_state = 2;
}
if (_state <= 2) {
this.c = this._io.readU1();
_state = 3;
}
This way, we can treat end-of-stream errors as non-fatal: namely, if EOS is encountered, we can keep the object in semi-constructed state (and even use it somehow, if it makes sense), and, as the time progresses and more data would become available in the stream, repeated calls to _read()
would effectively resume parsing from where it has stopped/crashed for any reason.
Relatively simple model, i.e. from end-user's point of view, one just needs to call _read()
=> check if the result is complete enough to be useful => use it => repeat. From developer's point of view, this one actually can be combined more or less seamlessly with "stateless" parsers as well, by introduction of some checkpoints (i.e. "subtype X not parsed => rewind to start of X, next time we'll be attempting to re-parse whole X"). Might integrate ok with tools like IDE.
Does not really help with parsing huge lists of objects that are not supposed to fit in memory. Can be relatively tricky and non-trivial for end-user to make decisions based on that _state
(no to mention that this naïve integer state model is super ugly and prone to breaking).
This proposal is much bolder and involves a very different usage model for KS-generated code. This is close to what you're proposing, i.e. instead of getting a single object with properties a
, b
, c
for the example above, we'll be expected to supply callbacks like on_a
, on_b
, on_c
(or subscribe one's handlers to these publication events, which is effectively more or less the same). This way, generated code would be something like that:
interface FooConsumer {
void onA(int a);
void onB(int b);
void onC(int c);
}
void _read(FooConsumer consumer) {
this._io.readU1().onSuccess((a) => {
consumer.onA(a));
this._io.readU1().onSuccess((b) => {
consumer.onB(b));
this._io.readU1().onSuccess((c) => consumer.onC(c));
});
});
}
Obviously, this implies that the runtimes must be also rewritten in async way and the target language supports some way to do that via callbacks / anonymous functions / etc. Getting to the end of stream in this paradigm is normally not considered an error, but just means that the callback would be delayed until all the data is there.
Allow very fine tuning of whatever is happening with the parsed data, i.e. if one wants a certain in-memory object, it's totally possible, but one needs to build it and track it manually; plays along well with totally async models like node.js; can handle huge repeated stuff like endless list of packets well.
Lots of manual (and error-prone) work, callback hell totally possible, very different from current concept. Always requires tons of custom user-written code to be useful. Not likely to be ever integrated into some rapid tools like web IDE.
Bottom line: no consensus for supporting either of (or both of) these proposals has been made, and nobody actually even done any proof-of-concepts for these proposals. In theory, first one would be easier to implement, but it won't cover all the use-cases. Second one brings KS closer to how traditional parser generators (like Lex/Yacc/etc), but it would kind of ruin the original intent of KS being a rapid development tool, providing a "good enough" default stream-to-object mapping.
Wow, thanks for the thorough response, this is much more than I was expecting!
That first option is frustrating because it's just about the perfect use-case for coroutines. Just run the parser in a coroutine which pulls data from a channel and blocks when there's no data left. Each member has an extra field called something like "completed", which you can inspect to see which things are ready for you to look at.
That doesn't map well to most languages though. I still like the idea of putting an isComplete
method onto each object which you can check to see if it's ready to be read. If every language supports "peek"ing into the stream then you can just save a pointer to the last object to be completely parsed, and not pull data out of the stream until there's enough of it to complete another object. Pull just the bytes necessary to complete the next object and repeat until there the only bytes left don't fit anywhere. Leave those in the stream and return control to the caller.
And you're right that the second one is the more direct equivalent to a streaming api. Personally, I'd prefer the first one, but probably the second one is better, since it satisfies both use-cases at once.
"on't generate object at all and use callback/pubsub-based model" -> this is much like a yacc parser, where the user writes the actions. As I see it, kaitai imposes a data model, then people (that may have their own data model) need to write code that gets the instances and adapts it to the target model. A trivial example is: I want to convert raw data into json. The very reason that I need to write and maintain the gluing code causes us to not use such an interesting parsing as kaitai: because it breaks the single responsibility principle, instead of parsing (which does just right), it imposes container classes that require extra code, extra memory, and extra runtime time (store, retrieve, convert).
@dgutson There are several approaches possible, and the fact that KS offers a certain in-memory model for the data can be both viewed as a good thing or as a bad thing. For many application, this "imposed" model is just fine to work with, and it actually saves a lot of keystrokes to avoid designing something like that manually.
This is something closer to XML parsing that you can do DOM-style or SAX-style. Both can get the job done, but DOM is usually easier to use for most projects, and SAX requires more effort, but potentially yields better performance in some scenarios.
Data-to-JSON convert actually exists in current form, it's available as ksdump utility, and it heavily relies on Ruby's internal representation of data loaded.
It's possible to implement pubsub-style (or whatever you want to call it — yacc-style, stream-style, etc) parsing using .ksy specs, and hopefully it would be eventually done, but it's not a top priority task right now.
@lithp I just discovered this old issue. I am trying to see how I could use kaitai with the postgres protocol. I understand that the kaitai runtimes are not currently handling long-lived protocols ("2. Kaitai has no support for parsing new messages as they come in.") but that a kaitai protocol description could make sense anyway. Did you happen to write a .ksy file for the postgres protocol and commit it somewhere ? Thanks.
The idea from #843.
seq
s.seq:
- id: a
type: u1
- id: b
contents: [0xFF]
- id: point1
type: _checkpoint
- id: c
type: u1
- id: point2
type: _checkpoint
- id: d
type: u1
def _read(self, finalState=None): """Final state is used to process loops""" self._finalState = finalState self._nextCheckpoint = self.class.INITIAL_CHECKPOINT self._readLoop()
def _readLoop(self): while(self._nextCheckpoint is not None): self._nextCheckpoint = self._nextCheckpoint()
3. The transpiler generates the code
```python
def _checkpointImplicit0(self):
this.a = this._io.readU1();
this.b = this._io.ensure_fixed_contents(b"\xff");
return self.point1
INITIAL_CHECKPOINT = _checkpointImplicit0
def point1(self):
this.c = this._io.readU1();
return self.point2
def point2(self):
this.d = this._io.readU1();
return self._finalState
while s._nextCheckpoint is not None:
try:
s._read()
except:
nextFFOffset = findNextFF(s._io)
s._io.seek(nextFFOffset + 1)
s._nextCheckpoint = s.point1
lastReadPos = 0
while s._nextCheckpoint is not None:
try:
s._read()
lastReadPos = s._io.pos()
except:
getBytesFromTheSocket(s._io)
s._io.seek(lastReadPos)
Compiler must generate checkpoints only if it is required. If not, it should generate the code it generates currently.
I'm trying to write a parser for the Postgres protocol.
The biggest problem comes from a combination of two things:
Postgres follows TLV for every packet except for the first packet which the client sends to the server. Those should be parsed differently.
I would like to be able to do something like:
However,
So instead, I have to write special code in my application. I have do to something like:
What I would much prefer is to be able to write something like:
It's not a problem that I have, but a streaming parser would also reduce memory requirements when parsing large files, which I'm sure other people would appreciate.