Serialization - Githubissues

KOLANICH commented 8 years ago

You have deserialization. How about serialization?

GreyCat commented 8 years ago

We've slowly discussing this issue, but it's obviously not as easy as it looks like. Probably it will be a major feature for something like second major version (v2.0 or something like that), so it'll eventually be there, but don't hold your breath for it.

KOLANICH commented 8 years ago

We've slowly discussing this issue, but it's obviously not as easy as it looks like.

What are the troubles?

GreyCat commented 8 years ago

It is pretty easy for simple fixed (C-style) structures. However, as soon as you start using instances that bind certain values to offsets in the stream, it becomes much more complex. A very simple example:

seq:
  - id: num_files
    type: u4
  - id: files
    type: file
    repeat: expr
    repeat-expr: num_files
types:
  file:
    seq:
      - id: file_ofs
        type: u4
      - id: file_size
        type: u4
    instances:
      body:
        pos: file_ofs
        size: file_size

This .ksy describes very simple file index structure which consists of (num_files) of (file_ofs, file_size) pairs. Each pair describes a "file", which can be accessed by an index using something like:

file_contents = archive.files[42].body

However, adding a file to such archive is a challenge. Ideally, one would want to do something like:

archive.files[42].body = file_contents

This should set relevant bound file_size automatically to accomodate length of assigned file_contents and, what's much more complex, assign file_ofs somehow. This is not an easy task: KS has no innate knowledge on how to manage unmapped space in the stream, is it limited or not, should you find some unused spot and reuse it or just expand the stream and effectively append file_contents to its end, etc.

What's even harder is that in many cases (i.e. archive files, file systems, etc) you don't want rewrite whole file, but just do some changes (i.e. appending new file to the end of archive, or reusing some pre-reserved padding, or something like that).

Another, may be even simpler example: if you read PNGs, you don't care about checksums. When you write PNGs, you have to generate proper checksums for every block — thus we need block checksumming algorithms and some way to bind it to block.

KOLANICH commented 8 years ago

what's the problem to deserialize them, edit, serialize back and then write?

GreyCat commented 8 years ago

What exactly do you refer as "them"? File archive example? PNG checksums?

KOLANICH commented 8 years ago

What exactly do you refer as "them"? File archive example? PNG checksums?

Almost anything, including files and checksums. In your example repeat-expr: num_files and id: num_files type: u4 means that "files has length num_files". When you deserialize, you read num_files, create files array of length num_files and read there num_files structures. When you serealize you need the inverse: update "num_files" with length and then write. What if "num_files" is complex expression we cannot derive automatically? We shift responsibility to a user and to be able serialize he must to provide both forward and inverse mappings for such expressions, except simple cases when they can be derived automatically.

Now some ideas how to implement this. When processing a KS file it should build a graph of object dependencies, in your case num_files <-> files[].length. Then we apply rules. In this case "scalar <-> array.length" -> "scalar <- array.length" which results in num_files <- files[].length. We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh). This way we transform the graph into tree and reduce the number of free variables in the case of equal strength. When you need to serialize you process the description member by member, launch lazy evaluation according to the graph and write them.

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

GreyCat commented 8 years ago

What if "num_files" is complex expression we cannot derive automatically?

Exactly my thoughts. And actually even that requires us to create some sort of inverse derivation engine. For example, if we have a binding:

    - id: body_size
      type: u4
    - id: body
      size: body_size * 4 + 2

and we update body, we should update body_size, assigning (body.size() - 2) / 4 there. If there are some irreversible bindings (i.e. modulo, hashes, etc) - then, at the very least, we need to detect that situation and allow some extra syntax to make it possible for user to set these inverse dependencies manually.

We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh).

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

The point is that we need to add some extra syntax (or API, or something) to make it possible to do custom space allocation or stuff like that. For example, if you're editing a file system, you can't just always append stuff to the end of it: block device usually has a finite capacity and sooner or later you'll exhaust that and your choice would be to pick and reuse some free blocks in the middle of the block device.

You seem to have good ideas on implementation, would you want to join and help implementing it? Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

KOLANICH commented 8 years ago

And actually even that requires us to create some sort of inverse derivation engine.

Or take some library. Symolic evaluation is rather studied area of math with lots of paper written. I haven't studied it, so it is possible tha some of the ideas I've mentioned were discussed in them.

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

Let we have an array and a size_t scalar - number of elements in array. What is main in the couple? What defines it entirely? Data in the array do, number is needed only to allow us to read the array correctly. For example in c-strings you don't need number because the convention is to terminate by \0. Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 2, and scalar 1. Then you have link "files[].length <-> num_files". Let we have another array "filenames[]" and its capacity num_files-2, let we also have foo and bar fields with something. Connections : filenames[].length <-> num_files, num_files <-> foo, bar <-> foo and files[].length <-> foo.

Now we start processing. Let show strenghs in brackets, absolute first. 1 files[].length (2,0) 2 files[].length files[].length (2,2) itself is processed, go to edges 3 files[].length (2,2) <-> num_files(1,0) 4 files[].length (2,2) <-> num_files(1,1) 2>1 so removing reverse edge and setting to 2 5 files[].length (2,2) -> num_files(1,2) 6 files[].length (2,2) <-> foo(1,0) 7 files[].length (2,2) <-> foo(1,1) 8 files[].length (2,2) -> foo(1,1) 9 files[].length (2,2) -> foo(1,2) 10 num_files(1,2) <-> filenames.length (2,0) 11 num_files(1,2) <-> filenames.length (2,2) equal, this saves both edges. 12 num_files (1,2) <-> foo(1,2) 13 foo(1,2) <-> bar(1,0) 14 foo(1,2) <-> bar(1,1) 15 foo(1,2) -> bar(1,1) 16 foo(1,2) -> bar(1,2)

then we need to serialize and read config 1 first goes the num_files it has incoming edge files.length. it is only incoming edge. 2 files[].length is already evaluated, take its value, eval. num_files. 3 it has 2 bidi edges, filenames[].length and foo evaluate them the same way and check if they match to value of filenum. 4 serialize and write 5 continue to the rest of fields.

explain why it is helpful, i.e. which problem we're trying to solve here?

it determines the order we should evaluate expressions and what expressions depends on what and helps to find conflicts.

and I think you should really read something about symbolic evaluation (I haven't, ideas above are just adhoc thoughts, maybe there are better approaches to it).

You seem to have good ideas on implementation, would you want to join and help implementing it?

Sorry, no. Maybe i'll send you some ideas or code later, but I can't be a permanent member of this project.

Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

OK.

GreyCat commented 8 years ago

Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 1, and scalar 0.

Ok, then what shall we do in case of the following:

seq:
  - id: num_objs
    type: u4
  - id: headers
    type: header
    repeat: expr
    repeat-expr: num_objs
  - id: footers
    type: footer
    repeat: expr
    repeat-expr: num_objs

This implies that headers[] and footers[] shall always have the same number of objects. How are the strengths are assigned in this case and how do we enforce that they have equal number of objects?

KOLANICH commented 8 years ago

1 see above example 2 throwing an exception, of course

ps graph evaluation is done by KS compiler runtume checks are done by generated code

and I fixed priorities to match the lines I made supposing strenghs were 1 and 2

GreyCat commented 7 years ago

I've commited very basic PoC code that demonstrated seq serialization in Java. It is available in distinct "serialization" branches. To test it, one'll need:

Obviously, only a few tests were converted, and, to be frank, now only a very basic set of types is supported. Even strings are not implemented right now. Testing is very basic too, one can run ./run-java and see that two packages are ran: spec is "normal", reading tests, and specwrite are writing tests.

I'd really love to hear any opinions on the API (both runtime & generated), Java implementation (it really pain in the ass, as ByteBuffer does not grow, and you have to preallocate the array, or probably reimplement everything twice with something that grows), etc.

GreyCat commented 7 years ago

Strings work, repetitions work, enums work, things are slowly coming to reality :)

GreyCat commented 7 years ago

Serialization progresses slowly. Basic in-stream user types and processing on byte types work, various fancy byte type stuff like terminator and pad-right works, etc, etc.

I've got an idea that might be very simple to implement.

Step 1: manual checks

Generate _read, _write and _check methods. _check runs all the internal format consistency checks to ensure that stuff that will be written will be read back properly. For example:

seq:
  - id: len_of_1
    type: u2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

would generate:

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(lenOf1() * 2 + 1), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != lenOf1() * 2 + 1)
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }

To use this properly, one must manually set both lenOf1 and str1:

r.setStr1("abcde");
r.setLenOf1(2);
r._check(); // should pass, so we're clean to take off
r._write(); // should write consistent data that's guaranteed to be readable back

Step 2: dependent variables

We declare some fields as "dependent", and mark them up in ksy:

seq:
  - id: len_of_1
    type: u2
    dependent: (str1.to_b("ASCII").size - 1) / 2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

This means that len1 becomes a read-only variable, setLenOf1 setter won't be generated. Instead, it would generate slightly different _write:

    public void _write() {
        this.lenOf1 = (str1().getBytes(Charset.forName("ASCII")).size() - 1) / 2;
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }

Obviously, using this boils down to single r.setStr1("abcde");.

Any comments / ideas / etc?

KOLANICH commented 7 years ago

Any comments / ideas / etc?

1) move evaluation of expression into a separate method 2) use check and write in generic form without explicit expressions in it, but with a method from 1) 3) Why not to use value instead of dependent? 4) why do we use .to_b("ASCII").size? String encoding is known, so why not just .size? 5) this dependent are ugly, it'd be nice to eliminate them, but we need a decision what kind of expressions should be resolved automatically. I guess linear ones should be enough. Another question is how to solve them. There is exp4j, for parsing and storage, but it'll require some code to build a simple symbolic gaussian elimination solver over it. If we wanna write in python, here are docs for lib wrapping multiple SMT solvers: https://github.com/angr/angr-doc/blob/master/docs/claripy.md

GreyCat commented 7 years ago

move evaluation of expression into a separate method

use check and write in generic form without explicit expressions in it, but with a method from 1)

You mean, something like that?

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(_sizeStr1()), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != _sizeStr1())
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }
    public int _sizeStr1() {
        return lenOf1() * 2 + 1;
    }

Does it bring any benefits? It won't really simplify generation (probably on the contrary), and we'll need to invent tons of names for all that size, if, process, etc expressions.

Why not to use value instead of dependent?

Naming is, of course, still to be discussed. One argument against value I have is that value is already used for reading in value instances,

why do we use .to_b("ASCII").size? String encoding is known, so why not just .size?

Using size on a string will give out a length of string in characters. If you'll put some non-ASCII into that string, proper .to_b("ASCII").size conversion will give you an exception, while just taking "number of bytes = number of characters" will give you corrupted data.

I guess we could try to do some sort of .bytesize method for strings taken verbatim from format definition, where "encoding is known", to save retyping the encoding name. However, it still won't work on modified strings, i.e. it's possible to implement str1.bytesize and it won't work with (str1 + 'x').bytesize (as the latter is CalcStrType, which lacks any source encoding info by design).

this dependent are ugly, it'd be nice to eliminate them

First of all, you can't really eliminate them completely in any case. Some functions are just irreversible, and in some cases you'll have more free variables than constraints. For example:

seq:
  - id: a
    type: u1
  - id: b
    type: u1
  - id: my_str
    type: str
    size: a + b

Even if you have byte size of my_str, you can't set both a and b automatically. Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible. In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

KOLANICH commented 7 years ago

value is already used for reading in value instances,

Yes, that's why I've chosen it. It has the same semantic: deriving a value of a struct member from other ones makes it some kind of an instance, but tied to offset.

we could try to do some sort of .bytesize

size attribute in a .ksy means size in bytes. So I see no reason it to mean anything else in expression language.

First of all, you can't really eliminate them completely in any case.

Yes, we have already discussed this.

Even if you have byte size of my_str, you can't set both a and b automatically.

Since they are of equal strentgh. This should cause ksc to throw a warning about undefined behavior.

Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible.

Not only. One expression can contradict another one, it can cause nasty errors ot can be used as a backdoor. Another side is that we (humans) don't know exact expressions ahead of time. So I propose the following: 1 to have a syntax to provide a manual expression 2 missing expressions are generated by compiler and inserted into a ksy with another type of expressions 3 human examines ksy output for errors and malicious code 4 verified output is used to generate actual code 5 we would sometimes want to

regenerate all non-manual expressions in a ksy
check that expressions don't contradict each other

In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

Maybe.

ixe013 commented 6 years ago

I would like to play with the serialization branch...

Is there a pre-built compiler and Java runtime of the serialization branch available? If not, that's ok, I'll setup a Scala build environment.

GreyCat commented 6 years ago

@ixe013 There are no pre-built packages for such experimental branches, but it's relatively straightforward to try it yourself. You don't really need anything beyond sbt — it would download all the required stuff (including Scala compiler and libraries of the versions needed) automatically. See dev docs — it's literally one command to run, like sbt compilerJVM/universal:packageBin — et voila.

glenn-teillet commented 6 years ago

I was able to build a Read-write version of my ksy, however the generaed code does not compile as the Reader reads from Int and stores them in Enums as Long (Int->Long) and the _writes() try to write the Enum Long as Int (Long->Int cause error).

I have these errors: The method writeBitsInt(int, long) is undefined for the type KaitaiStream The method writeU1(int) in the type KaitaiStream is not applicable for the arguments (long) The method writeU2be(int) in the type KaitaiStream is not applicable for the arguments (long)

glenn-teillet commented 6 years ago

I also have errors in every constructor, where it tries to assign _parent received as a KaitaiStruct in the constructor signature to the _parent without casting to the more specific type.

Type mismatch: cannot convert from KaitaiStruct to KaitaiStruct.ReadWrite

GreyCat commented 6 years ago

@glenn-teillet There's still lots of work to do, it's nowhere near production-ready. Before trying to compile read-life ksy files with lots of complicate details, probably we should spend some time getting read-write test infrastructure to work, and, probably, porting that older serialization branch code to modern codebase.

KOLANICH commented 6 years ago

Some suggested definitions for the serialization spec:

Serialization. Let we have a binary format f, a set of sequences of bits FS, its subset of sequences of bytes making a valid format FS_f, a set of object-oriented Turing-complete programming languages PL, a set of valid Kaitai Struct definitions KSY, including the subset of definitions for the format f KSY_f, and the KS compiler KSC : PL × KSY → (PSC, SSC), where PSC: FS → O is a set of a parsing programs, SSC: O → FS is a set of serializing programs and ssc_{ksy_f}(psc_{ksy_f}(s)) ≡ s ∀s ∈ FS_f, ∀ksy_f ∈ KSY_f, ∀pl ∈ PL, KSC(pl, ksy_f)=(psc_{ksy_f}, ssc_{ksy_f}). To be practically usable there should be a way to create an o= psc_{ksy_f}(s) programmatically without doing any parsing of actual bit string s.

Serialization is the part of KSC producing serialization programs.

Internal representation are the objects created by KSC-generated code in program's runtime.
Finalization is a process of transforming the internal representation that way, that only trivial transformations are left to be done to create a serialized structure.
An expression is the mapping of a subset of internal representation variables called expression arguments to another variable called expression output.
A reverse expression is an expression mapping an original expression's output back to its arguments with respect to the current state of the internal representation (including the arguments).
Trivial transformations are the ones not involving computing any expressions. Examples of trivial transformations are endianess conversions and bit shifts induced by using bit-sized fields.

Comments?

FSharpCSharp commented 6 years ago

This is a very interesting subject. What is the current status of this? Will this be pursued in the future? It would be very good if you could use serialization. Especially for binary formats based on bit's this would be a great relief. I hope that the issue will be pursued further and that we can also benefit from it in the future.

GreyCat commented 6 years ago

No updates here, latest comments like https://github.com/kaitai-io/kaitai_struct/issues/27#issuecomment-334616207 still reflect current state of affairs. Help is welcome :)

cryptocode commented 6 years ago

Definitely a complex issue. For my part, an implementation that generates write methods for each field with a simple type (uint32_t, etc for C++) to a stream would go a long way - I could do the object graph traversal and "finer parts" myself.

GreyCat commented 6 years ago

@cryptocode If you're interested in C++ support, then I'd suggest to start with implementing related C++ runtime methods. See this Java runtime branch for generally proposed API.

mbrudka commented 6 years ago

@KOLANICH Did you consider a relaxed definition for the serialization: psc_{ksy_f}(ssc_{ksy_f}(o)) ≡ o ∀o ∈ O_f, ∀ksy_f ∈ KSY_f, where O_f is a space of structure values valid with respect to format f? This definition enables KSC to ignore internal structure relations and makes KSC user responsible to ensure that ∀o ∈ O_f. In construct there is a field Computed which enables to compute some values during serialization using context. Unfortunatelly, this is Python specific and cannot be directly implemented in any PL. BTW. Why do you require PL to be Turing-complete?

KOLANICH commented 6 years ago

where O_f is a space of structure values valid with respect to format f?

I don't quite understand what you mean under this. Do you mean O_f be a set of internal representations of a format without self-consistency? Further in this message I assume that you mean this.

There are at least 2 issues with your definition.

1

psc_{ksy_f}(ssc_{ksy_f}(o)) ≡ o ∀o ∈ O_f, ∀ksy_f ∈ KSY_f

This relaxed definition allows a degenerate case where the serializer dumps a serialized internal representation into any other format except the original one and the parser checks if the binary string is a serialized internal representation and deserializes it if it is, otherwise parses. Or smuggles some additional info in the fields without an id. In this case the result will not belong to FS_f and that's why be unreadable by external programs. We surely wanna avoid it. So a binary string should be primary and the internal representation should be secondary.

2

This definition enables KSC to ignore internal structure relations and makes KSC user responsible to ensure that ∀o ∈ O_f.

In fact the goal is to make KSC be responsible for enforcement that the output belongs to FS_f and make the serializer derive and fix the internal representation to be self-consistent. This should allow automatic processing without any human intervention. We just feed ksys and blobs and get the IRs, we process IRs fully automatically, then we dump IR and get a correct self-cinsistent file which can be processed with other programs.

In construct there is a field Computed which enables to compute some values during serialization using context. Unfortunatelly, this is Python specific and cannot be directly implemented in any PL. BTW.

I prefer the reverse expressions for computing when serializing to be derived by the compiler from all other expressions. I have used some computer algebra systems (including Open Source ones, but Wolfram Mathematica does far better than anything I saw in Open Source) and saw them being able to find analytic solutions for equations. I guess the expressions we are dealing with in ksys are far simpler and in most cases can be derived automatically. But if they cannot, or the derived ones are inefficient, we need some syntax to explicitly provide them.

Why do you require PL to be Turing-complete?

Because non-turing-complete languages lack loops/recursion which are needed to process structures with repeat-*, so most of structs are useless for them (and I guess that KSY may be Turing-complete itself but I have no proof for it). Most of programming languages are Turing-complete too. So just for simplicity.

mbrudka commented 6 years ago

@KOLANICH The rest of your reply suggest that you understand what is O_f. I am really glad you see the differences between definition as some may think they are equivalent. Let me then not define it precisely, but please do it for the sake of clarity and completeness if you find my proposition interesting. If about issues you mentioned:

In general your assumption about FS_f is still at force, thus the definition does not allow to dump structures into arbitrary strings. Certainly, the definition allows to create frames which are not consistent and may be not recognised by other applications eg. frame with a bad checksum. This is why we assume that users is responsible to ensure that ∀o ∈ O_f.
The idea to enforce the semantic consistency on KSC level though appealing is rather an illusion. The consistency is the responsibility of the business logic, but not parsers or builders. It is better to avoid it in KS as it will delay the serialisation feature, then will be hard for users and finally will be continuously fixed to meet new requirements. We just need to accept that some frames will not be consistent..

I am not exactly sure what you mean by reverse expressions - is this reverse polish notation? But I hope that the serialisation will not depend on the implementation of analytic solution solvers as they are really much bigger piece of software that KS.

In general you are probably right to require PL to be Turing complete. Nevertheless the assumption is not necessary as long as you do not use it in any formal proof or in the implementation.

Concluding: increasing responsibilities lead many software into a dead-end street. Extensive consistency validation is such path in my opinion. KS may avoid it by sticking directly for transforming bits into structures and (hopefully soon) structures into bits.

KOLANICH commented 6 years ago

I am not exactly sure what you mean by reverse expressions

A reverse expression is an expression mapping an original expression's output back to its arguments with respect to the current state of the internal representation (including the arguments).

The consistency is the responsibility of the business logic, but not parsers or builders.

Then we need this feature be disableable. I guess in both runtime and compile-time separately: in compile time to cut costs on building reverse expressions and to build smaller code, in runtime to use the code with reverse expressions in more complex cases like fuzzing.

It is better to avoid it in KS as it will delay the serialisation feature, then will be hard for users and finally will be continuously fixed to meet new requirements.

Have you read the thread? Our roadmap to the serialization is to implement and debug manual reverse expressions first. When we have it working it would be immediately useful - anyone would be able to start using it, but it may be troublesome: there may be lots of variables in formats and it's very labour-consuming to construct and maintain them for every field and instance. The next step, the most tricky one, is to implement autoderivation of them so we wouldn't have to write and maintain them explicitly in most cases.

But I hope that the serialisation will not depend on the implementation of analytic solution solvers as they are really much bigger piece of software that KS.

z3 is not very big. And I hope that all the heavy lifting can be done in compilation phase.

Nevertheless the assumption is not necessary as long as you do not use it in any formal proof or in the implementation.

Yes, we have not stated that parsers, serializers and ksy itself exist ;) But if we need them exist for every valid ksy we have to assume that the language they are transpiled to is Turing-complete.

mbrudka commented 6 years ago

Have you read the thread

I did not assumed that the thread discussion on ideas is a roadmap. But let it be, so good to hear that you want to restrict yourself to manual reverse expressions.

In my opinion even this feature is not necessary to deploy serialisation. The developer will use the code from KS to built its own application with its own logic above pure parsers/builders. Practice suggest that this is a better place to implement various validators, computables or adapters. Construct imposed some conventions to express this, but it heavily relied on Python specifics which are not available in all KS target languages. Personally I like ascetic protlr which just ignores expressions and provide industry ready solution for protocol developers.

I used some similar features in Construct for years as well as observed how my developers were using it. And it was quite problematic. Not because of the implementation - it was sound, but the architectural decision to do this on parsers level. The order of evaluations, the validity of the context, the scope of visibility for embedded structures led to bugs and excessive debugging which is particularly unpleasant for declarative languages. The result: after some hand-on experience the developers ignored this stuff and implemented computables on the application level eg. checksums.

The implementation of expressions expressible in every KS supported language PL is also not a piece of cake even PL is T-complete :). Consider for example the abstraction layer for expression representation which will facilitate code generation for Ruby. It is the additional complexity on language KSY bindings.

Take all this as a free advice to avoid unnecessary complexities and implementation. Use or ignore it. But I hope that your decisions which will bring builders sooner than later. KS has a potent, but its applications without builders are very limited.

Yes, we have not stated that parsers, serializers and ksy itself exist ;)

Do they? But wait, I see you have the sketch of a theorem: if PL is T-complete, then for every valid KSY a parser and a builder may be generated which follow some relations... Time for show the proof to prove the usability of T-completness :).

HTH

KOLANICH commented 6 years ago

But let it be, so good to hear that you want to restrict yourself to manual reverse expressions.

Only as a point on the way to fully-automated serialization.

Practice suggest that this is a better place to implement various validators, computables or adapters.

It's a lot of programmer's effort. The goal of this project is to automate this effort.

The order of evaluations, the validity of the context, the scope of visibility for embedded structures led to bugs and excessive debugging which is particularly unpleasant for declarative languages. The result: after some hand-on experience the developers ignored this stuff and implemented computables on the application level eg. checksums.

Could you give more details, such as examples and your thoughts about the causes of troubles.

KS has a potent, but its applications without builders are very limited.

We I guess we would have "dumb" (without automatic consistency enforcement) builders anyway.

duncanfinney commented 5 years ago

+1 for this. Even a hybrid approach would be cool. Maybe it could generate code for the simple C structs and then you have to manually implement it for more complicated things.

ckaran commented 5 years ago

I think that @duncanfinney comment about a hybrid approach is probably going to be the best one to take; Kaitai Struct supports a number of languages already, and if all goes well, it will support many more in the future, not all of which will have the ability to even express some of the more complicated concepts suggested here. Instead of trying to make Kaitai Struct support all of these concepts, produce code that defers to the user; if someone wants a computed value, then the generated code will call a function that computes that value (e.g., via a new operator, 'call_external'). If that function doesn't exist at compile/interpretation time, then the compiler/interpreter will do whatever is appropriate for that language (compiler error, crash, log an error, raise an exception, etc.) Don't try to solve all problems at once.

Separately, I noticed earlier in the thread that there was a question of how to serialize objects that are dependent on one another. Forgive me if I'm just not getting it, but isn't this just a case of topologically sorting the object graph? Or is there something more complicated happening?

KOLANICH commented 5 years ago

but isn't this just a case of topologically sorting the object graph?

Not quite, the resulting graph may still contain cycles, but if the result is self-consistent, it is valid, and we need to verify that. For example, if we have the following pseudo-ksy

seq:
   - id: a
     type: u1
   - id: b
     type: u1
     assert: _ == a

, it means that if we set a, then b must be set automatically to the same value, and if we set b, then a must be set automatically to the same value, and if a user sets both variables, the code generated must ensure that they are set to the same value. And we wanna do as much as possible in compile time, C++ static_assert may be useful.

And we really want to derive the stuff automatically, it would be infeasible to explicitly write formulas for every combination of variables a user can change simultaneously.

And I'm pretty sure that automatic derivation is possible, because optimizing compilers and computer algebra systems do the similar stuff and some of them are free software.

mbrudka commented 5 years ago

@ckaran Good to read you agree with hybrid approach rather that full automated solution at once.

@KOLANICH Your example somehow illustrates our old discussion. I'd not recommend to generate the code which enforces a and b to be equal. Your idea to just put static_assert (for C++) during serializing of b is the right and simple way. Let's trust that user made some efforts to feed Kaitai with good data and generate the failsafe code only when it is so easy as above. This is exactly what I intended when facetiously proposed the relaxed definition for the serialization.

Having said that, I have to admit sadly that there is still no serialization in Kaitai :(

ckaran commented 5 years ago

@KOLANICH I agree with you that optimizing compilers and computer algebra systems are able to find a wider set of self-consistent descriptions, and I believe that having that ability in Kaitai Struct is a worthwhile goal, but I don't think that it is one that can be accomplished within Kaitai Struct exclusively unless Kaitai Struct's language is extended to be fully Turing complete. Instead, I think that Kaitai Struct should follow the approach that Python Construct has taken, making validation of a large number of use cases easy, while punting the more difficult problems like you've described back to the end user who will need to write validation code in their language of choice. As problem areas are discovered, Kaitai Struct can be extended to cover those areas. This prevents paralysis from waiting for the perfect solution.

KOLANICH commented 5 years ago

@ckaran, we are not waiting for a perfect solution, we are waiting for a developer who can, has time and will to implement even the very basic solution which just writes the stuff back by the offsets it was written without any checks and relocations - we need it anyway, it is lower level building block required for any better serialization solution, though I ask to keep in mind the goal while designing the stuff. I currently don't have time to implement it, and, I guess, neither has GreyCat.

mbrudka commented 5 years ago

@ckaran Not sure it there is a sense to make declarative ksy a Turing complete language. It is better to follow Construct approach. @KOLANICH It's a pity, that my implementation cost is too high - I'd have to learn scala, then understand kaitai internals. Two months of preparation for one month of coding.

ckaran commented 5 years ago

@mbrudka I think that there was a misunderstanding; I do not want ksy to be Turing Complete, I just meant that if you want to cover every possible corner case from within Kaitai Struct (KS), then you'll need a Turing-complete language. I personally prefer the approach taken by python Construct, where the simple/common stuff is handled within Construct, and anything more complicated is handled at a higher level.

@KOLANICH I understand about finding developers for the project! Like you, I only have time to dream big right now. That said, I do have a little time to think up ideas, and maybe start planning things out. Are you willing to work on general plans for what a low-level implementation would look like?

My initial thought is to follow the serialization methods outlined in the YAML spec. We can use the ideas for the representation graph and the serialization tree pretty much directly.

The representation graph gives us the lowest-level we need to support; at this level, we assume that the representation graph is correct. We also only need to support the most basic data types in the union of the recommended YAML schema and ksy.

Transforming this to a serialization tree is where I see the hard-work happening; you'll want the node representing the number of elements in a following sequence to occur before the array does, but the checksum over the entire stream to be calculated and stored at the end. Doing this in a portable manner implies to me that you need to extend ksy so that you can mark an attribute as having to occur before or after some other attribute, which will allow you to build a serialization tree in the correct order. This is also where you might have custom code hooks that you call out to; e.g., the value of the checksum node is over some portion of the nodes directly ahead of the checksum node.

Once you have the serialization tree, the process becomes very mechanical, just walking all the nodes in the tree in the correct order, and writing their contents out.

Thoughts, comments, critiques?

S-o-T commented 5 years ago

Any updates on the topic? Is there realization for minimal case viable to generate C++/STL code?

GreyCat commented 5 years ago

None that I know of.

amrnablus commented 5 years ago

So, I skimmed through the discussion and it seems everyone is making very good points. Undoubtedly, this feature/project is very complex and it needs to be broken down into iterations, here are a few suggestions on how to proceed:

We can ignore all sorts of validation in the initial phase. If the client developer needs to validate the serialized implementation they can convert the resulting binary data back to a kaitai struct object and do the validation externally.

We can initially delegate some of the work to the client developer. For instance, in this case:


meta:
id: test
file-extension: bin
seq:
- id: num_files
type: u1
- id: file_entry
type: u1
repeat: expr
repeat-expr: 12

The developer can do something like:

test.file_entry.append(file_object) test.num_files += 1



And then we can gradually automate some of those functionalities, that would be both easier to implement and can help anyone with simple requirements to use kaitai for serialization.

Feedback is welcome

GreyCat commented 5 years ago

@amrnablus Just to be clear: it's not like we lack clarity on how to proceed, the question is that it needs to be done.

From a conceptual point, what you suggested here is basically exactly what's been proposed and, to some extent, implemented in that Java PoC branch. Implementing _write for attr sequences or implementing write for instances is relatively simple, and basic (manual) validation is actually pretty simple too: it's just a matter of checking certain things, a user will still have to do what you've suggested manually even with the validation, i.e.:

test.file_entry.append(file_object)
test.num_files += 1

The question of implementing this basically boils down to a few things:

Updating runtimes (relatively easy)
Proposing API changes (easy, but needs to be taken care of for every target language separately)
Testing it — we'll need to build a whole new testing infrastructure for writing/validation, which just does not exists now. Using KST (https://github.com/kaitai-io/kaitai_struct/issues/253) will make implementing this easier, but it's still a massive chunk of work (medium-hard).
Implementing write seq / write parse instance support (medium).
Implementing manual validation (medium).
Implementing autocorrection (hard to super hard).

KOLANICH commented 5 years ago

BTW, in my repo there is some code for python runtime that can be repurposed easily for implementing writing numbers.

amrnablus commented 5 years ago

Ah thanks, let me take a look at the progress so far and see if I can contribute.

dullbananas commented 5 years ago

How much progress has been made on this feature

fudgepop01 commented 5 years ago

I think there's something incredibly interesting discussed in this talk [[here] (https://youtu.be/AdNJ3fydeao)] that gives me some ideas that (may or may not have) already been brought to the table

It speaks of how svelte (a frontend framework/web-compiler) handles truly reactive data. It sort of takes the javascript $: operator (don't ask me what it's supposed to do natively, idk either), and generates something reactive from that. Watch the talk to experience some more that it can do - it's really cool.

I think this is a brilliant way of handling dependencies between variables, and it really helped me visualize what something like that might look like in kaitai struct. Theoretically, could there be a second stage of compilation for the generated files? The first pass is what we have now which generates the parser.

The change would be placing markers with data about values that relate to each other. It doesn't need to be syntactically correct to the native language - that's for the second pass to handle. For now it just generates the parser along with some type of markers. This means that what's generated at this point is essentially just a long string.

The second pass (perhaps a kaitai_struct_linker or something) would ONLY look for these markers and replace them with the proper code to implement the relationships between them.

In my (possibly incredibly naivé) mind, this is good because it separates the concern of recompilation and serialization into two different phases that could potentially be toggled. If a parser is already in place and variables just need to be "linked" - just run the kaitai_struct_linker. If there is only a need for a parser, then just run the standard compiler that we already have (with an option toggled to leave out the markers).

...Does this idea make any sense?

KOLANICH commented 5 years ago

Current processors architecture in large part is incompatible to serialization. Having 2 functions is enough for very dumb processors, but not enough for smart ones like compression algorithms.

For compression algorithms we have to introduce a function for getting their arguments from the blob.
but these arguments arguments are not the only parameters. In fact we can split dictionary from the arguments. This means a yet another method to compute an optimal dictionary shared by muptiple streams.
And both these methods are specific to compression algos only.

So we need another architecture.

Probably we should eliminate process and introduce types doing the same instead coupled with typed value-instances #127, templates #135 and chains #196.

It may look like

seq: 
  - id: dictionary
    size: 256
  - id: compressed_blob0
    size: 100
  - id: compressed_blob1
    size: 500
instance:
  a_compression:
    type: kaitai.compress.algo
  b_compression:
    type: kaitai.compress.algo
  a:
    io: kaitai.compress.algo.uncompressed._io
    type: a

where computation of optimal dictionary should be specified in a separate spec which should be a part of KSC stdlib.

More precisely

compression_type:
  params:
    - id: compressed
      type: bytes
      generate: compress(args, dict, uncompressed)
    - id: args
      type: any
      generate:
        - get_args(compressed)
        - compute_optimal_args(uncompressed[i])
    - id: dict
      type: bytes
      generate: compute_optimal_dict(uncompressed[], args) # [] is used to inform the compiler that if all the other args form the key, by which all uncompressed are aggregated using this function. I have thought a bit about Einstein notation, but currently have no idea how much use it has here
    - id: uncompressed
      type: bytes
      generate: decompress(args, dict, compressed)

defines an interface of processor type

and

kaitai.compress.*: compression_type

defines that the user-implemented types under that namespace are compressors

jchv commented 5 years ago

FWIW, just to keep conversations up to date, I am working on trying to add support for basic serialization capabilities in various runtimes.

C++: kaitai-io/kaitai_struct_cpp_stl_runtime#30
Golang: kaitai-io/kaitai_struct_go_runtime#16

It looks largely like the Java runtime at least for the write functions themselves, which is entirely accidental (I had only just looked to see if there were any major differences.) I plan to continue pushing PRs to as many runtimes as possible, please yell at me if I'm doing it wrong. :)

(Sidenote: I am aware for more advanced write functionality it would make sense to work on the compiler and runtime portions in parallel, but for these really basic operations I don't think it matters very much, and it's low hanging fruit.)

kaitai-io / kaitai_struct

Serialization #27

Step 1: manual checks

Step 2: dependent variables