Scanning (seek, search) for attribute start & end positions

Driving scenarios

Normally, binary formats are well-defined and do not require parser to employ trial-and-error tactics to pinpoint exact locations of the data.

However, in practice, in quite a few cases, we need to go to location identified not by some sort of exact address, but by some arbitrary properties of what's around that location. The simplest case of that is "read the string until the null byte appears", which is currently taken care of terminator / consume / include semantics. Much more complicates use cases exist, for example:

Reading Windows "wide strings", which are actually UTF-16LE strings, consisting of 2-byte characters, up until 00 00 group is encountered.
Reverse engineering a certain format, when you don't have any good idea how exactly to address a new section, but you have a hypothesis that certain sequence of bytes serves as a magic which identifies a new section.
JPEG/JFIF parsing requires you to either decode image contents completely, or, alternatively, scan until certain combination of bytes is encountered (#158).
Even more complex cases like #410.
etc

The proposal is to make a unified framework that could be used in both:

As pos: ... substitute in instances, to be executed before parsing of the element to find the starting position
As terminator: ... or size: ... substitute, to determine ending position for the element

Terminology proposal

Let's call it "scanning", as this is actually what would be going on. "Seeking" is a bad choice of word, as typically seek is a API call which moves stream pointer to a certain pinpointed location and does not involve any scanning actions. "Searching" is a more generic term, which does not reflect the fact that this is a pretty expensive operation.

KSY syntax proposal

For instances, one would be able to specify scan-start key with "scan specification" described below. When this instance will be invoked, first thing it will be before doing actual reading, it will invoke scan algorithm to position the stream pointer properly.
For all attributes where you already can specify size or terminator, one would be able to specify scan-end key with "scan specification" as well.

For contents of this "scan specification" would be akin to current "process" specifications, i.e. scan_mode_name(param1, param2, ...). We can start with a few "scan modes" and add more of them later.

Initial ideas for scan modes to implement:

kaitai.byte(0) — reimplement single-byte terminator, scan until a certain byte is encountered (0 in this example)
kaitai.bytes([0x45, 0x4e, 0x44]) — scan until sequence of 3 bytes is encounterd
kaitai.bytes_group(2, [0, 0]) — two-byte terminator working at groups of 2 bytes every time
kaitai.byte_reverse(0) — same as "byte", but scans in reverse direction

Examples of usage

Scans for *** [42, 42, 42] 3-byte sequence, parses u4 after that:

instances:
  foo:
    scan-start: kaitai.bytes([42, 42, 42])
    type: u4

The same as terminator: 0:

seq:
  - id: my_strz
    scan-end: kaitai.byte(0)

Scans until OrbShdr is encountered, catches everything up to that byte sequence as substream and parses it using my_custom_body type:

seq:
  - id: body
    type: my_custom_body
    scan-end: kaitai.bytes([0x4f, 0x72, 0x62, 0x53, 0x68, 0x64, 0x72])

The same, but includes OrbShdr into substream as well:

seq:
  - id: body
    type: my_custom_body
    scan-end: kaitai.bytes([0x4f, 0x72, 0x62, 0x53, 0x68, 0x64, 0x72])
    include: true

For instances, scan-start can be combined with pos. In this case, pos is executed first, scanning starts working after that. This scans from the end of file backwards, searching for first 0x42:

instances:
  foo:
    pos: _io.size
    scan-start: kaitai.byte_reverse(0x42)

Implementation proposal

All scan specifications will internally invoke a certain implementation class, very similar to current custom processing classes implementation. Scan mode names will map to these class names. Special prefix kaitai. would be used to designate scan mode implementations bundled with runtime library.

For Java, for example, this could be:

interface CustomScanner {
    void scanToStart(KaitaiStream io);
    void scanToEnd(KaitaiStream io);
}

Implementation for kaitai.byte might be:

public class Byte implements CustomScanner {
    private int byteToScan;

    public Byte(int byteToScan) {
        this.byteToScan = byteToScan;
    }

    private void scan(KaitaiStream io) {
        while (true) {
            int c = io.readU1();
            if (c == byteToScan) {
                return;
            }
        }
    }

    @Override
    void scanToStart(KaitaiStream io) {
        scan(io);
        io.seek(io.pos() - 1);
    }

    @Override
    void scanToEnd(KaitaiStream io) {
        scan(io);
    }
}

scan-start: byte(42) + type: u1 will be translated into something like:

var scanner = new io.kaitai.struct.scanners.Byte(42);
scanner.scanToEnd(io);
this.foo = io.readU1();

scan-end: byte(42) + include: true for byte array will be something like:

var pos1 = io.pos();
var scanner = new io.kaitai.struct.scanners.Byte(42);
scanner.scanToEnd(io); // due to include: true
var pos2 = io.pos();
io.seek(pos1);
this.foo = io.readBytes(pos2 - pos1);

Finally, scan-end: byte(42) + include: false (which is default) will be something like:

var pos1 = io.pos();
var scanner = new io.kaitai.struct.scanners.Byte(42);
scanner.scanToStart(io); // due to include: false
var pos2 = io.pos();
io.seek(pos1);
this.foo = io.readBytes(pos2 - pos1);

These new properties are in fact doing the job too similar to size and pos, so probably it may make sense to merge them there:

pos: _io.size
scan-start: kaitai.byte_reverse(0x42)

can be

pos: _io.size - kaitai.byte_reverse(0x42)

Scanning op is translated into an offset from the .

arithmetic ops are not commutative:

if a number is in the left side of an arythmetic operation, the stream is seeked to that offset first.
if is in the right, the scanner works first
multiplication and division are allowed too
but right multiplication a*f() is not allowed

left multiplication is transformed into looped addition: (a+f()+b)*x:

for i in range(x):
self._io.seek(self._io.pos()+a)
f(self._io) # seeks internally
self._io.seek(self._io.pos()+b)

An instance

pos: ...
value: kaitai.byte_reverse(0x42)

would just result into an offset of the first 0x42 in the backward direction.

Algo ideas:

compiler parses AST of an expression
compiler does dfs, splits expressions containing noncommutative funcs calls and checks conformance
compiler checks if subexprs having noncommutative calls are multiplied by something, if they are, compiler marks them as ones to be transformed into a loop, by replacing them with a loop object
in each subexpr compiler groups the commutative stuff into objects. So there are only 3 types of nodes remain: groups commutative exprs, noncommutative calls and loops
compiler does DFS again, dumping the transformed AST into source code:
- groups are computed first and then seeked
- calls are just called
- loops are inserted

For value instances with pos: initial pos is remembered, then the function doing the stuff is called, then pos is retrieved, original one is subtracted, the value is stored, the original pos is restored.

So terminator may also be eliminated: size: kaitai.byte(...) will do the trick. It is shorter than terminator, so IMHO should be replaced immediately (the stuff is alpha, breaking changes may be introduced, also we may want a script automatically migrating old specs onto a new lang version, so we wouldn't have to stockpile deprecated features in compiler code base, but do it in migration script).

two-byte terminator working at groups of 2 bytes every time

Do you mean alignment?

I think the modes you have given an example should be hidden behind a single interface: kaitai.bytes. And we need a separate namespace for scanners: we already have kaitai.compress, probably more stuff will be added in future, so we need to have it organized from very beginning, otherwise it will be a mess.

These new properties are in fact doing the job too similar to size and pos, so probabli it may make sense to merge them there:

What I want to emphasize here is that scanning typically involves a lot of reading (trial-and-error). This process eventually moves IO stream position, and this is actually what we want. There's no point in going back and forth like "memorize old position, scan, get & return new position, jump back to old position, realize that we've got new position, jump to the new position".

arithmetic ops are not commutative: ...

This is much much much more complex to implement, has lots of restrictions due to these symbolic ops, potentially leads to more arithmetics, jumps back and forth, etc. With all that, I don't really any upsides for this solution.

Do you mean alignment?

Effectively, yes.

I think the modes you have given an example should be hidden behind a single interface: `kaitai.bytes``. And we need a separate namespace for scanners: we already have kaitai.compress, probably more stuff will be added in future, so we need to have it organized from very beginning, otherwise it will be a mess.

There will be separate namespace, as all these monikers are processed in separate YAML keys, i.e.:

kaitai.foo.bar for scanners will likely translate into something like io.kaitai.struct.scanners.foo.Bar in Java
foo.bar.baz for scanners will be foo.bar.Baz.

For processing routines, it can be:

kaitai.foo.bar => io.kaitai.struct.processing.foo.Bar
foo.bar.bz => foo.bar.Baz

What I want to emphasize here is that scanning typically involves a lot of reading (trial-and-error). This process eventually moves IO stream position, and this is actually what we want. There's no point in going back and forth like "memorize old position, scan, get & return new position, jump back to old position, realize that we've got new position, jump to the new position".

The proposed syntax is just an interface, it just behaves like if the funcs were returning an offset from the current position, if the current position was given by a preceeding part of the expression. The underlying impl may not to compute any expressions matching pos or size explicitly or doing backward seeks and then forward seeks. And in order to do it we may need a subsystem for passing contracts of external runtime functions into the compiler. I mean if it seeks or if it returns a value, how much args does it have, what are their types, ... etc, as proposed in #500.

This is much much much more complex to implement, has lots of restrictions due to these symbolic ops, potentially leads to more arithmetics, jumps back and forth, etc. With all that, I don't really any upsides for this solution.

In fact not all arythmetic ops, but only ones involved into noncommutative functions calls. So part of ops in the expression is commutative, part are not.

There will be separate namespace, as all these monikers are processed in separate YAML keys

Yes, I have thought about it. But

foo.bar.baz for scanners will be foo.bar.Baz.

feels like ah inconsistency. Maybe third-party packages must also contain the needed namespace, and we should add the info where it lies into the KS "header" for opaque funcs?

I agree with the big proposal.

Just a question, what is the point to have two different implementation between Byte and Bytes?

We can merge those things. Following the example of @GreyCat in Java, a possible implementation in Python would be

class Scanner:

    def __init__(self, pattern):
        self._pattern = pattern
        self._len = len(pattern)
        self._border = self._kmppre()

    def _kmppre(self):
        border = [-1]
        match_index = -1
        for i in range(0, self.len):
            while match_index >= 0 and self.patter[i] != self.patter[match_index]:
                match_index = border[match_index]
            match_index += 1
            border.append(match_index)
        return border

    def _scan(self, io, patter):
        r = b''
        match_index = 0 # for kmp search
        while True:
            c = self._io.read(1)
            while match_index >= 0 and c != self.patter[match_index]:
                match_index = self.border[match_index]
            match_index += 1
            if match_index == self._len: # It's a match
                return

    def scanToStart(self, io):
        _scan(io)
        io.seek(io.pos() - self.len)

    def scanToEnd(self, io):
        _scan(io)

The complexity here is O(M) where M is the amount of bytes between the initial position of IO and the end position of it.

Knuth-Morris-Pratt algorithm

The implementation that I used, its from my team algorithmic competition notebook icpc-notebook-vasito

kaitai.bytes_group(2, [0, 0]) — two-byte terminator working at groups of 2 bytes every time

I don't know what that mean or what is the purpose. I read a little bit but I couldn't figure out the usage.

Just synced with @tinrodriguez8 in the chat. Here's the broad plan: we start with simple and very basic stuff. Let's start with, say, 2 or 3 languages (Python and Java sounds like a good plan), and eventually add more as it grows.

Before you start

Please check that:

You're able to build the compiler
You're able to build test formats and run tests for at least some languages (say, Python and Java)

Tests

We'll need to create some tests and everything needed for it to run. Tests generally are to be put in tests repo. Given that this whole thing is somewhat similar to custom processing things, it's best to look up relevant tests and its implementations, e.g. process_custom.ksy test

In general, for this to run, we'll need:

.ksy format file (or several files) in formats dir
Spec file (one per .ksy format) that will run that format and make some checks to ensure that we're getting what we're expecting to get; specs are per-language, so these can go with spec/python and spec/java for a start.

Ideally, we'd like to move away from writing specs manually for every single language, and there's a thing called KST translator, but I'm not sure at this stage it will be possible to make use of it, given that these specs would likely require loading/importing of external stuff. Or may be it will work?

Finally, a custom scanner class implementation; for Python, these would likely go into [spec/python/extra] for Java, it will be somewhere near that spec/java/src/io/kaitai/struct/testformats/MyCustomFx.java

Python custom scanner won't probably require anything extra, but Java custom scanner will probably need to implement a certain formal interface. If we're happy with proposed interface, then we can just start with putting it into java runtime repo, along with CustomDecoder (which is the interface for custom processing routines).

For the contents of test custom scanner, I'd start, again, with something very basic and simple. For example, scanner that reproduces terminator behavior, may be even with hard-coded byte terminator value.

After that we'll have the tests which would obviously fail at first due to lack of support in the compiler. Then we can start working on the compiler.

Compiler

It will need updates in 2 places:

Format parsing to allow scan-start and scan-end keys in seq attributes / instances.
Code generation to treat these as per plan in the issue and generate relevant pieces of code.

Format parsing is generally done in shared/src/main/scala/io/kaitai/struct/format, namely AttrSpec / InstanceSpec.

Code generation update would be probably the hardest of all that, as it involves several different files in the code gen pipeline. Probably it would be easier to reach to me at that stage, I can explain in more detail.

For visibility: @tinrodriguez8 started a PR here: https://github.com/kaitai-io/kaitai_struct_compiler/pull/166

kaitai-io / kaitai_struct