Open GreyCat opened 5 years ago
These new properties are in fact doing the job too similar to size
and pos
, so probably it may make sense to merge them there:
pos: _io.size
scan-start: kaitai.byte_reverse(0x42)
can be
pos: _io.size - kaitai.byte_reverse(0x42)
Scanning op is translated into an offset from the .
arithmetic ops are not commutative:
a*f()
is not allowed(a+f()+b)*x
:
for i in range(x):
self._io.seek(self._io.pos()+a)
f(self._io) # seeks internally
self._io.seek(self._io.pos()+b)
An instance
pos: ...
value: kaitai.byte_reverse(0x42)
would just result into an offset of the first 0x42
in the backward direction.
Algo ideas:
For value instances with pos: initial pos is remembered, then the function doing the stuff is called, then pos is retrieved, original one is subtracted, the value is stored, the original pos is restored.
So terminator may also be eliminated: size: kaitai.byte(...)
will do the trick. It is shorter than terminator
, so IMHO should be replaced immediately (the stuff is alpha, breaking changes may be introduced, also we may want a script automatically migrating old specs onto a new lang version, so we wouldn't have to stockpile deprecated features in compiler code base, but do it in migration script).
two-byte terminator working at groups of 2 bytes every time
Do you mean alignment?
I think the modes you have given an example should be hidden behind a single interface: kaitai.bytes
. And we need a separate namespace for scanners: we already have kaitai.compress
, probably more stuff will be added in future, so we need to have it organized from very beginning, otherwise it will be a mess.
These new properties are in fact doing the job too similar to size and pos, so probabli it may make sense to merge them there:
What I want to emphasize here is that scanning typically involves a lot of reading (trial-and-error). This process eventually moves IO stream position, and this is actually what we want. There's no point in going back and forth like "memorize old position, scan, get & return new position, jump back to old position, realize that we've got new position, jump to the new position".
arithmetic ops are not commutative: ...
This is much much much more complex to implement, has lots of restrictions due to these symbolic ops, potentially leads to more arithmetics, jumps back and forth, etc. With all that, I don't really any upsides for this solution.
Do you mean alignment?
Effectively, yes.
I think the modes you have given an example should be hidden behind a single interface: `kaitai.bytes``. And we need a separate namespace for scanners: we already have kaitai.compress, probably more stuff will be added in future, so we need to have it organized from very beginning, otherwise it will be a mess.
There will be separate namespace, as all these monikers are processed in separate YAML keys, i.e.:
kaitai.foo.bar
for scanners will likely translate into something like io.kaitai.struct.scanners.foo.Bar
in Javafoo.bar.baz
for scanners will be foo.bar.Baz
.For processing routines, it can be:
kaitai.foo.bar
=> io.kaitai.struct.processing.foo.Bar
foo.bar.bz
=> foo.bar.Baz
What I want to emphasize here is that scanning typically involves a lot of reading (trial-and-error). This process eventually moves IO stream position, and this is actually what we want. There's no point in going back and forth like "memorize old position, scan, get & return new position, jump back to old position, realize that we've got new position, jump to the new position".
The proposed syntax is just an interface, it just behaves like if the funcs were returning an offset from the current position, if the current position was given by a preceeding part of the expression. The underlying impl may not to compute any expressions matching pos
or size
explicitly or doing backward seeks and then forward seeks. And in order to do it we may need a subsystem for passing contracts of external runtime functions into the compiler. I mean if it seeks or if it returns a value, how much args does it have, what are their types, ... etc, as proposed in #500.
This is much much much more complex to implement, has lots of restrictions due to these symbolic ops, potentially leads to more arithmetics, jumps back and forth, etc. With all that, I don't really any upsides for this solution.
In fact not all arythmetic ops, but only ones involved into noncommutative functions calls. So part of ops in the expression is commutative, part are not.
There will be separate namespace, as all these monikers are processed in separate YAML keys
Yes, I have thought about it. But
foo.bar.baz for scanners will be foo.bar.Baz.
feels like ah inconsistency. Maybe third-party packages must also contain the needed namespace, and we should add the info where it lies into the KS "header" for opaque funcs?
I agree with the big proposal.
Just a question, what is the point to have two different implementation between Byte and Bytes?
We can merge those things. Following the example of @GreyCat in Java, a possible implementation in Python would be
class Scanner:
def __init__(self, pattern):
self._pattern = pattern
self._len = len(pattern)
self._border = self._kmppre()
def _kmppre(self):
border = [-1]
match_index = -1
for i in range(0, self.len):
while match_index >= 0 and self.patter[i] != self.patter[match_index]:
match_index = border[match_index]
match_index += 1
border.append(match_index)
return border
def _scan(self, io, patter):
r = b''
match_index = 0 # for kmp search
while True:
c = self._io.read(1)
while match_index >= 0 and c != self.patter[match_index]:
match_index = self.border[match_index]
match_index += 1
if match_index == self._len: # It's a match
return
def scanToStart(self, io):
_scan(io)
io.seek(io.pos() - self.len)
def scanToEnd(self, io):
_scan(io)
The complexity here is O(M) where M is the amount of bytes between the initial position of IO and the end position of it.
The implementation that I used, its from my team algorithmic competition notebook icpc-notebook-vasito
kaitai.bytes_group(2, [0, 0]) — two-byte terminator working at groups of 2 bytes every time
I don't know what that mean or what is the purpose. I read a little bit but I couldn't figure out the usage.
Just synced with @tinrodriguez8 in the chat. Here's the broad plan: we start with simple and very basic stuff. Let's start with, say, 2 or 3 languages (Python and Java sounds like a good plan), and eventually add more as it grows.
Please check that:
We'll need to create some tests and everything needed for it to run. Tests generally are to be put in tests repo. Given that this whole thing is somewhat similar to custom processing things, it's best to look up relevant tests and its implementations, e.g. process_custom.ksy test
In general, for this to run, we'll need:
Ideally, we'd like to move away from writing specs manually for every single language, and there's a thing called KST translator, but I'm not sure at this stage it will be possible to make use of it, given that these specs would likely require loading/importing of external stuff. Or may be it will work?
Python custom scanner won't probably require anything extra, but Java custom scanner will probably need to implement a certain formal interface. If we're happy with proposed interface, then we can just start with putting it into java runtime repo, along with CustomDecoder (which is the interface for custom processing routines).
For the contents of test custom scanner, I'd start, again, with something very basic and simple. For example, scanner that reproduces terminator
behavior, may be even with hard-coded byte terminator value.
After that we'll have the tests which would obviously fail at first due to lack of support in the compiler. Then we can start working on the compiler.
It will need updates in 2 places:
scan-start
and scan-end
keys in seq attributes / instances.Format parsing is generally done in shared/src/main/scala/io/kaitai/struct/format, namely AttrSpec / InstanceSpec.
Code generation update would be probably the hardest of all that, as it involves several different files in the code gen pipeline. Probably it would be easier to reach to me at that stage, I can explain in more detail.
For visibility: @tinrodriguez8 started a PR here: https://github.com/kaitai-io/kaitai_struct_compiler/pull/166
Driving scenarios
Normally, binary formats are well-defined and do not require parser to employ trial-and-error tactics to pinpoint exact locations of the data.
However, in practice, in quite a few cases, we need to go to location identified not by some sort of exact address, but by some arbitrary properties of what's around that location. The simplest case of that is "read the string until the null byte appears", which is currently taken care of
terminator
/consume
/include
semantics. Much more complicates use cases exist, for example:00 00
group is encountered.The proposal is to make a unified framework that could be used in both:
pos: ...
substitute in instances, to be executed before parsing of the element to find the starting positionterminator: ...
orsize: ...
substitute, to determine ending position for the elementTerminology proposal
Let's call it "scanning", as this is actually what would be going on. "Seeking" is a bad choice of word, as typically
seek
is a API call which moves stream pointer to a certain pinpointed location and does not involve any scanning actions. "Searching" is a more generic term, which does not reflect the fact that this is a pretty expensive operation.KSY syntax proposal
instances
, one would be able to specifyscan-start
key with "scan specification" described below. When this instance will be invoked, first thing it will be before doing actual reading, it will invoke scan algorithm to position the stream pointer properly.size
orterminator
, one would be able to specifyscan-end
key with "scan specification" as well.For contents of this "scan specification" would be akin to current "process" specifications, i.e.
scan_mode_name(param1, param2, ...)
. We can start with a few "scan modes" and add more of them later.Initial ideas for scan modes to implement:
kaitai.byte(0)
— reimplement single-byteterminator
, scan until a certain byte is encountered (0 in this example)kaitai.bytes([0x45, 0x4e, 0x44])
— scan until sequence of 3 bytes is encounterdkaitai.bytes_group(2, [0, 0])
— two-byte terminator working at groups of 2 bytes every timekaitai.byte_reverse(0)
— same as "byte", but scans in reverse directionExamples of usage
Scans for
***
[42, 42, 42] 3-byte sequence, parses u4 after that:The same as
terminator: 0
:Scans until
OrbShdr
is encountered, catches everything up to that byte sequence as substream and parses it usingmy_custom_body
type:The same, but includes
OrbShdr
into substream as well:For instances,
scan-start
can be combined withpos
. In this case,pos
is executed first, scanning starts working after that. This scans from the end of file backwards, searching for first 0x42:Implementation proposal
All scan specifications will internally invoke a certain implementation class, very similar to current custom processing classes implementation. Scan mode names will map to these class names. Special prefix
kaitai.
would be used to designate scan mode implementations bundled with runtime library.For Java, for example, this could be:
Implementation for
kaitai.byte
might be:scan-start: byte(42)
+type: u1
will be translated into something like:scan-end: byte(42)
+include: true
for byte array will be something like:Finally,
scan-end: byte(42)
+include: false
(which is default) will be something like: