Construct target optimizations and enhancements

kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby

https://kaitai.io

4.03k stars 197 forks source link

Construct target optimizations and enhancements #394

Open GreyCat opened 6 years ago

GreyCat commented 6 years ago

This issue will accumulate and track various optimizations and enhancements for Construct target (#377).

[ ] Use CString where applicable instead of NullTerminated(GreedyString)
[ ] Use Bytes(n) instead of FixedSized(n, GreedyBytes)
[x] Use StringEncoded(FixedSized(GreedyBytes)) instead of FixedSized(GreedyString)
[ ] Use default values of include, consume, term in NullTerminated
[ ] Topological sort of type declarations to avoid / minimize using LazyBound(lambda: type_name)
[ ] Computed(...) sometimes can be used without lambda
[ ] Inline Structs (and Enums) declarations where applicable / desirable, or as render option
[ ] Enums could add # comments including docs and -original-name
[ ] -construct-render: options for Padding Flag inlineStruct inlineEnum

arekbulski commented 6 years ago

Bytes and FixedSized(GreedyBytes) are not interchangeble, they have different build semantics. Bytes takes data of exact length, the other takes whataever is shorter and adds padding. I suggest we make FixedSized default and use -construct-render: Bytes as option.

GreyCat commented 6 years ago

Then it should be exactly Bytes, as size: X does not do any padding and is expected to fail when reading past end of stream.

arekbulski commented 6 years ago

FixedSized(10) would fail if given a 5 byte stream. The difference is in how it builds (which Kaitai doesnt do, right?). If size: N, type: struct1 means a substream passed to Struct then size: N should mean a substream passed to GreedyBytes. You are the guy who designed it so I wont pretend to know better than you what those tags are supposed to mean, but my impression/understanding is, size is supposed to mean (1) constrain parsing to a chunk (2) consume entire chunk even if type needs less bytes. This would at least imply that building would also have (2) semantics, build that many bytes even if type produced less. Meaning, add padding.

GreyCat commented 6 years ago

When building:

size: 5

the byte array or string is expected to be exactly 5 bytes long. Everything else would result in a pre-build validation error. A slightly more complicated case:

- id: len
  type: u4
- id: buf
  type: str
  size: len

Here, buf is expected to match len exactly. If it doesn't, it's a validation error. User either need to fix len or fix buf before building. If user's just submitting parse result into building, they're guaranteed to match.

For a user type invocation with a substream:

size: 5
type: foo

that foo type is expected to be 5 bytes or less. If it is more than that, it's validation error. If it's less than that, then padding would be applied to ensure that it's exactly 5 bytes in the output.

The point here is read-write-read consistency: one is expected to build cleanly whatever's was just parsed from a valid input binary stream, and, vice versa, pase exactly the same stuff that was built.

GreyCat commented 6 years ago

OTOH, you're absolutely right, these build semantics are something that could be implementation-dependent, so there's no point in enforcing a specific behavior.

arekbulski commented 6 years ago

I think that KSY syntax has inherently some abiguity, it doesnt express intentions very clearly. For example size is often used as Padding as much as it is used as Bytes. Also u1 and b1 are often used as Flag and not an integer. Which leads me to an idea, would you consider adding a u1bool type? That of course being separate from -construct-render: Flag option.

arekbulski commented 6 years ago

Use Bytes(n) instead of FixedSized(n, GreedyBytes)

Actually I just commited some optimisation that makes FixedSized detect that subcon is GreedyBytes, in which case it doesnt use a substream but just returns the data bytes. Performance of these two is identical. Build semantics however are not. And there is also readability aspect.

arekbulski commented 6 years ago

Computed(...) sometimes can be used without lambda

I will need to run some benchmark to confirm this, but I believe that lambda is actually faster than a this expression. The latter was concieved as a tool for convenience, something that is easier to write, but its not by any measure intended to be top performance. Therefore I would actually consider leaving current form as the default.