kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.95k stars 191 forks source link

Evaluate what can be borrowed from construct and hachoir #104

Closed KOLANICH closed 6 years ago

KOLANICH commented 7 years ago

There are another projects for the same/similar purposes: https://github.com/construct/construct and https://github.com/vstinner/hachoir3/ . We need to evaluate what we can take from them and how we should cooperate with them (for example implementing tools converting consruct description format into ksy and back should allow to share description files).

It is also written in the docs that construct has some serialization support.

koczkatamas commented 7 years ago

@KOLANICH if you could create a comparison sheet what are similarities, what are the differences, what ideas could be reused from construct in Kaitai Struct that would be great!

So far the only idea came into my mind is that we can create a construct "language" which could generate a construct descriptor from Kaitai format, so we can provide somewhat serialization support for people who use construct and Python. But we have only some really limited development resources (our freetime), so we should decide wisely how to spend it. In this case (at least for now) I think it's better to spend our time on Kaitai features / bugfixes (even on Kaitai-based serialization, but there are still more serious issues) than construct compatibility.

arekbulski commented 6 years ago

It is also written in the docs that construct has some serialization support.

No, constructs can be pickled but there is no serialization.

arekbulski commented 6 years ago

I have a fairly limited knowledge of Kaitai so please tell me if following already exist:

I suggest that you skip everything else that is not on this list. Docs for these classes: https://construct.readthedocs.io/en/latest/#api-reference

GreyCat commented 6 years ago

Please don't refer to Kaitai Struct project as "Kaitai". There are other projects under that umbrella name (some of them are not very visible), yet it's still better to refer to the language as "KS" ;)

BytesInteger (arbitrary sized ints, u16 for 128-bit ints)

No, and it's probably not very easy to support. Lots of target language have no standard implementation of something like bigints that you'd want to map BytesInteger to.

PascalString

Generally, there's no need for it: one creates it as a user-defined typed in place, and reuses as needed.

GreedyString

Yes, that's type: str + size-eos: true

Range

I'm not sure I understand what it should map to in KS.

RepeatUntil

repeat: until + repeat-unti: condition

Rebuild

"Value instances" are probably the equivalent. Something like:

instances:
  foo:
    value: bar.length + 42 # assuming `bar` is a string or byte array

Default

Nothing like that, I guess, and probably KS's approach would be pretty different from Construct's, as we don't offer propagation of control on derived structures like dictionaries & enums.

Numpy

Something like that is actually discussed in #188.

NamedTuple

No equivalent, as similar structure does not exist in majority of target languages.

Padded and Aligned

Not yet, there's #12 submitted about that long ago, and we're still living with a workaround proposed there.

Pointer

That's what "parse instances" are for:

instances:
  foo:
    pos: ofs_foo + 42 # at stream position `ofs_foo + 42`
    type: u4 # there's one 32-bit unsigned integer

RawCopy

Something similar is achieved by using --debug mode for compilation: it creates special arrays that one can query for position of attributes inside the stream (that's how all visualizers work). Heavily affects performance, so it is usually turned off in production code. Not supported for Python target yet too :(

Prefixed

Probably, substreams is what we're talking about here:

seq:
  - id: len
    type: u4
  - id: body 
    size: len
    type: some_subcon

body is guaranteed to work on its own stream that has exactly len bytes.

PrefixedArray

repeat: expr + repeat-expr: number_of_items

arekbulski commented 6 years ago

As far as Construct is concerned, you can close this topic.

There is a framework called suitcase but its identical to Construct 2.5 (from 2 years ago). https://digidotcom.github.io/python-suitcase/latest/index.html

There is a Construct-equivalent for Java, buts its also identical to 2.5. https://github.com/ZiglioUK/construct

Hachoir, correct me if I am wrong but isnt that just a collection of file formats essentially? Its not a framework for making parsers, its a collection of parsers. Right? Also the toppost link is invalid: https://hachoir3.readthedocs.io/

GreyCat commented 6 years ago

Hachoir is a framework, why not? This is, for example, their basic parsing example with relevant ksy:

from hachoir.field import Parser, CString, UInt16
class Point(Parser):
    endian = LITTLE_ENDIAN
    def createFields(self):
        yield CString(self, "name", "Point name")
        yield UInt16(self, "x", "X coordinate")
        yield UInt16(self, "y", "Y coordinate")

In KS:

meta:
  id: point
  endian: le
seq:
  - id: name
    terminator: 0
    doc: Point name
  - id: x
    type: u2
    doc: X coordinate
  - id: y
    type: u2
    doc: Y coordinate
arekbulski commented 6 years ago

I found one interesting feature in Hachoir, and decided to add it to Construct. I suggest adding a Kaitai schema too. The class is Timestamp.

KOLANICH commented 6 years ago

No, constructs can be pickled but there is no serialization.

Please check carefully. There are references to it in the docs


>>> format = Struct( ... "signature" / Const(b"BMP"), ... "width" / Int8ub, ... "height" / Int8ub, ... "pixels" / Array(this.width * this.height, Byte), ... )
>>> format.build(dict(width=3,height=2,pixels=[7,8,9,11,12,13]))
b'BMP\x03\x02\x07\x08\t\x0b\x0c\r'

and in the source code there is some code ( https://github.com/construct/construct/blob/master/construct/core.py#L1014 https://github.com/construct/construct/blob/master/construct/core.py#L737 https://github.com/construct/construct/blob/master/construct/core.py#L947 https://github.com/construct/construct/blob/master/construct/core.py#L1246 https://github.com/construct/construct/blob/master/construct/core.py#L1369 https://github.com/construct/construct/blob/master/construct/core.py#L1419 https://github.com/construct/construct/blob/master/construct/core.py#L2079 and many other ones ) related to it.

KOLANICH commented 6 years ago

I guess that Timestamp should not be implemented as a type, but as a function for process

arekbulski commented 6 years ago

What you call serialization, I simply call building. Serialization means transforming constructs themselves into bytes and back, the templates, not the data.

KOLANICH commented 6 years ago

I don't understand what you mean, could you clarify?

Let me give my definition of serialization in context of KS and especially in context of #27.

Let we have a binary format f, a set of sequences of bits FS, its subset of sequences of bytes making a valid format FS_f, a set of object-oriented Turing-complete programming languages PL, a set of valid Kaitai Struct definitions KSY, including the subset of definitions for the format f KSY_f, and the KS compiler KSC : PL × KSY → (PSC, SSC), where PSC: FS → O is set of a parsing programs, SSC: O → FS is a set of serializing programs and ssc_{ksy_f}(psc_{ksy_f}(s)) ≡ s ∀s ∈ FS_f, ∀ksy_f ∈ KSY_f, ∀pl ∈ PL, KSC(pl, ksy_f)=(psc_{ksy_f}, ssc_{ksy_f}). To be practically usable there should be a way to create an o= psc_{ksy_f}(s) programmatically without doing any parsing of actual bit string s.

Serialization is the part of KSC producing serialization programs.

Do you mean the similar under building? What do you mean under serialization then?

arekbulski commented 6 years ago

I guess one man's drink (glass of water) is another man's quantum fluctuations field.

Building is opposite of parsing.

Serialization is building where the data type is a schema. Construct is about to be added a Pickled construct, which writes arbitrary Python objects using Pickle protocol. Just so happens that constructs (schemas) are picklable as well. For example:

schema = Struct("template"/Pickled, "value"/GreedyBytes)
d = BytesInteger(4)
schema.build(dict(template=d, value=d.build(0))) -> bytes1

This would allow to send arbitrary schemas over wire, as well as arbitrary data that was built using those schemas. The app on other side of the socket could work like this:

schema = Struct("template"/Pickled, "value"/GreedyBytes)
x = schema.parse(bytes1)
x.template -> arbitrary schema
x.template.parse(x.value) -> arbitrary object
KOLANICH commented 6 years ago

Thank you for the clarification.

arekbulski commented 6 years ago

Construct added * operator for attaching docstrings, see: https://construct.readthedocs.io/en/latest/advanced.html#documenting-fields

arekbulski commented 6 years ago

I am closing this because: