kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.94k stars 191 forks source link

bootstrapping? kaitai for kaitai? #952

Open swang206 opened 2 years ago

swang206 commented 2 years ago

Does this kaitai provide parsing for kaitai itself? bootstrapping seems quite interesting for me when I want to do something with kaitai

generalmimon commented 2 years ago

@swang206 I can't completely rule out that it's possible, but the representation would be impractical and the .ksy spec would be a mess. Besides, it would be probably inefficient both in time and memory (that's what happens when you use an inappropriate tool for the task).

KSY stands for Kaitai Struct YAML, so you need to parse YAML - that's not an easy task in general. There is a reason why we still don't have our own YAML parser (https://github.com/kaitai-io/kaitai_struct/issues/229) in kaitai-struct-compiler and still rely on external ones for each environment. Although now there is a promising parser https://github.com/jodersky/yamlesque, which is suprisingly small (the main YAML parsing code has just below 400 LoC, so you can look at it: Parser.scala). Of course, it doesn't implement all features from YAML spec, but that is mostly a good thing, because it can stay minimal and we use only basic YAML features anyway. However, we would need at least flow style support, which is currently missing: https://github.com/kaitai-io/kaitai_struct/issues/229#issuecomment-1013884903

For an idea would parsing of a structured text format (JSON) look like in Kaitai, I have a simple example:

json_objs_and_strings.ksy contents ```ksy meta: id: json_objs_and_strings seq: - id: root type: tag types: tag: -webide-representation: '(...)' seq: - id: start contents: '{' - id: key_value_pairs type: key_value repeat: until repeat-until: _.end_lookahead != ',' - id: end contents: '}' key_value: -webide-representation: '{key}: {value}' seq: - id: key type: str_lit - id: colon contents: ':' - id: value type: tag_value - id: continuation contents: ',' if: end_lookahead == ',' instances: end_lookahead: pos: _io.pos size: 1 type: str encoding: ASCII tag_value: -webide-representation: '{value_lit}{value_tag}' seq: - id: value_tag type: tag if: value_lookahead == '{' - id: value_lit type: str_lit if: value_lookahead == '"' instances: value_lookahead: pos: _io.pos size: 1 type: str encoding: ASCII str_lit: -webide-representation: '{lit}' seq: - id: lit_start contents: '"' - id: lit terminator: 0x22 # '"' type: str encoding: UTF-8 ```

Test data (adapted from https://json.org/example.html):

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook."},"GlossSee":"markup"}}}}}

As you can see, something this simple can be achieved with Kaitai Struct, but note that it only handles significantly reduced subset of JSON (assumes no whitespace, knows only string values - no numbers, doesn't know about arrays or character escapes in strings, etc.). Nevertheless, I can imagine that it could be extended to support most of JSON. But YAML is noticeably more complex than JSON, so it would be probably much more difficult to write a working .ksy for that, and I don't think the result would be worth the effort.