kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.95k stars 192 forks source link

Bringing a C-like alternative to ksy to reality #567

Open fudgepop01 opened 5 years ago

fudgepop01 commented 5 years ago

For a long time now, various developers have been turned away from this incredibly powerful tool due to the fact that there is no alternative to the YAML syntax provided. I'd like to change that in the coming months if I can.

Because I got fed up with the yaml structure myself (and was / still am relatively new to programming (~4 years of experience)), I sought out to create an alternative written entirely in typescript that used nearley to transpile and generate a parser.

However, it now occurs to me that it would be much more productive for me to figure out a way to transpile my bitdef "language" into the proper YAML syntax that is used by Kaitai Struct.

This way, developers that are familiar with a C-like syntax would be much more inclined to use this tool as a final solution to the issue of writing parsers for their language of choice.

fudgepop01 commented 5 years ago

Here is an example of the target I wish to pursue: https://gist.github.com/dar2355/e414e232db0f952b2983504095e9508d

I'll be transpiling a few more files by hand so I know exactly how I want certain conventions to go, but that's the... gist of it. 👍

GreyCat commented 5 years ago

@dar2355 I want to invite you to discuss some of the general approaches you'll be taking and try to strive you to keep it closer to how KS works ;)

For example, can we keep overall KS specs structure? I.e. top-level element is always a TypeSpec, and:

In that spirit, I'd propose to start with a very literal translation of existing practices like that:

type something {
  meta {
    file_extension = "gz";
    endian = le;
    // note: no id:, as all types already have a type in its header
  }
  params {
    u4 some_int_param;
    str some_str_param;
    f8[] double_array_param;
  }
  seq {
    u4 foo;
    u2 len_my_str;
    str my_str { size = len_my_str; };
    user_type my_obj { process = xor(0xaa); size = 1024; };
  }
  instances {
    block_type block { pos = 5; io = my_obj._io; };
  }
  type user_type {
    // another inner type declared here
    seq {};
    instances {};
  }
  enum animal {
    dog = 4;
    cat = 7;
  }
}

... and gradually think of how this can be improved for brevity.

Lots of open questions, of course:

fudgepop01 commented 5 years ago

there are definitely many questions and design decisions I've been thinking about. Currently, instead of trying to replicate all the formats I think it's best that I try and translate the ksy user guide to this bitdef language, because that will end up containing most (if not all) of the specifications that can be easily reviewed and updated later on if necessary.

Here is what I have so-far:

bitdef_spec.txt bitdef_spec.md

(same file, just different extensions)


The design philosophy i'm currently going with is to model them on C-structs but with preprocessor-like features that make it easy to apply things to a general scope.

For instance, [the if statements]

I introduce the idea of having multiple items rely on a singular if statement to avoid repetition: image

To add other attributes that I'm unsure how to model in an intuitive way, I use @<attribute> above the declaration, seen here: image

Because they apply to the scope of whatever they're above (and can be overridden by ones later down the line), a separate meta "struct" isn't necessary.


For now, I plan to write out the parser (along with whatever preprocessor steps are necessary) with nearley.js, for that's what i'm most familiar with. It would really be a proof-of concept before moving on to allowing others to translate it into scala or whatever Kaitai Struct uses (which is fastparse, if I recall correctly).


so my general plan is / has been:

  1. write out the user guide
  2. convert some formats
  3. build the parser with nearley 3.1. make syntax highlighting for it
  4. convert that parsed data into the same kind of AST that ksy uses (which can be done with nearley's postprocessors)
  5. convert that into fastparse / whatever Kaitai Struct uses

Along each step i'll be taking feedback and pull requests. For now though, I'm still only on step 1.

dwsinger commented 3 years ago

This would be cool, and super-useful. I'd like to either USE or GENERATE a syntax that's human-readable, in the documentation. At the moment I use a quasi-C syntax that's readable enough; it's not formal enough and not implemented. Here's a trivial example.

class GeneralTypeBox(code) extends Box(code) {
    unsigned int(32)    major_brand;
    unsigned int(32)    minor_version;
    unsigned int(32)    compatible_brands[];    // to end of the box
}
class FileTypeBox extends GeneralTypeBox ('ftyp')
{}

I'm not wedded to this syntax, but I want something that can either be fed into kaitai, or is an output, that can be fed into humans.