kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4k stars 196 forks source link

Ksy modularization / external ksy inclusion (discussion) #71

Open koczkatamas opened 7 years ago

koczkatamas commented 7 years ago

I feel the need for multiple, somewhat coupled features, but I am not exactly sure how to approach the problem, so I try to describe them here, so we can discuss them and then decide whether we should open distinct issues or not.

External type / ksy inclusion

For example in the Mach-O format there is an ASN.1 structure in one of the code signature blocks. So I'd like to indicate in the macho.ksy at this code signature field that it requires the ASN.1 "module" and which type should be used from there.

This way the ASN.1 module can be automatically imported, so it won't be the responsibility of the end-user anymore to import every required module. (For example the WebIDE could import the required additional .ksy files before compiling the parser code.)

Also the user could determine what should happen if the module is missing: an error should be thrown, or the parsing should be skipped and the inner contents will be available only as a byte array, etc.

But to reference these external modules we need...

Module names and versions

If we use unique names (ex. io.kaitai.serialization.asn1) for every module instead of direct references (ex. ../serialization/asn1/asn1_der.ksy) then module dependencies will be weakly-coupled and can be resolved dynamically. Even searching in multiple repositories (for example an internal repository and a public, kaitai.io repository).

As these modules can independently change, we need some kind of versioning too. So we can make sure exactly the same module will be used and thus the parsing won't break after an update suddenly.

Maybe it's worth to use the 'major version change breaks the compatibility' rule here too, otherwise try to keep the compatibility.

Extension points for internal content

Sometimes we need to invert the importing progress.

Let's take the example of TCP packet. You cannot list every possible internal packet in the tcp.ksy, so we need some kind of extension / merging support.

One of the options is that an included, but external .ksy could extend the TCP module by adding new enums and switch cases.

Other option is to indicate that one of the fields is an 'extension point' and a static or even a dynamic loader could search though all the extensions which published that they can extend io.kaitai.network.tcp_segment.body and select the appropriate (by their filter expression or magic matching pattern, etc).

In the latter case the best would be if we could decouple the extension points from extensions. For example neither the TCP packet should know that it can accept SSL extension, nor the SSL extension should know that it can be embedding into TCP, but we could use an external mapping file which lists these mappings. Although if we use magic matching pattern (or at least give hints to the application) then maybe we don't need to couple things together manually at all.

I know it's a lot and chaotic at first, but I think it's worth to start discussing about these possible improvements.

KOLANICH commented 7 years ago

io.kaitai.serialization.asn1

application/ber-stream:1.0@kaitai.io is better, where application/ber-stream is a mime type 1.0@kaitai.io is a KS document version where kaitai.io is any unique identifier to distinguish between different mutually incompatible versions of the description of the same format 1.0 is a local version (any identifier, for example git commit hash or release name)

GreyCat commented 7 years ago

@koczkatamas I completely agree on that we'd want some import mechanism and file format identification. Basically, all the history of programming language development proves that it's (at least currently) is a right way to go, and we shouldn't restrict ourselves to stone-age techniques like "compiler that compiles only one file" and "separate header file needed to be #included".

There are generally two approaches to solve that problem: (1) n source files + 1 project file, (2) n sources files with some import-like statements. Probably a "project" file for KS would be overkill, so it's natural to discuss an "import" solution. Probably something like that will do the trick:

meta:
  id: ip_packet
  imports:
    - some_id_of_tcp_packet
    - some_id_of_udp_packet

As for naming, it's actually somewhat complicated matter. For starters, we'll need to match some namespacing scheme in existing languages. Right now, we don't touch that subject, except for enforcing some namespace/package by language-specific CLI options (i.e. --php-namespace, --dotnet-namespace, etc). And quite a few languages do not have a concept of namespaces at all.

So, my proposal is to start with very simple stuff like:

By default, it's good idea to point this "ksy root" to something like format repo checkout. Obviously, there could be more than one "root" (as in search path), and there should be some option to specify it in ksc, something like ksc -I /path/to/my/format/lib.

Note that this proposal does not ban us from using more schemes to specify importing components in future. After all, we can actually use arbitrary map in import specification, something like that (Maven-style):

  imports:
    - package: tcp
      vendor: some_one
      group: some_group
      version: 1.7
      repo: http://foo.bar.example.com/some/path

@KOLANICH Using MIME type as an identifier is a bad idea: tons of formats lack proper MIME type and in those who have MIME type, it's not unique (for example, audio/wave, audio/wav, audio/x-wav, audio/x-pn-wav, audio/vnd.wave, audio/x-riff are used for the same plain .wav audio files). Lots of binary formats that KS parsers are not file formats at all (i.e. network protocols, etc), and thus they would never have a MIME type.

KOLANICH commented 7 years ago

@KOLANICH Using MIME type as an identifier is a bad idea: tons of formats lack proper MIME type and in those who have MIME type, it's not unique (for example, audio/wave, audio/wav, audio/x-wav, audio/x-pn-wav, audio/vnd.wave, audio/x-riff are used for the same plain .wav audio files).

For this case there should be a possibility to declare multiple mime types for a single ks document.

Lots of binary formats that KS parsers are not file formats at all (i.e. network protocols, etc), and thus they would never have a MIME type.

We can use own mime-types for them if we specify own additions to top level media types and document them in an rfc :).

koczkatamas commented 7 years ago

Mime-type can be described as meta information inside the file, see this issue for further information: https://github.com/kaitai-io/kaitai_struct/issues/59.

I agree with @GreyCat , I don't like the idea of referencing .ksy descriptors via mime types either.

KOLANICH commented 7 years ago

Mime-type can be described as meta information inside the file

It certainly can.

I don't like the idea of referencing .ksy descriptors via mime types either.

The original idea was to make different versions of ksy for the same type interchangeable and allow KS docs to import any (suitable) version available by either not specifying either a full version (then any (means "implementation defined") found will be used), or a local version (then any version with the same version ID will be used), or specifying constraints (which work only for local versions in known formats: git commit hash and sem. versioning).

About versioning implementation: it is IMHO obvious. One process opens a HTTP service, scans the specified folder and processes the requests. Another process is KS compiler which uses HTTP REST API to get the files it needs. The security is through challenge-response authorization (captchas, OpenPGP, x509, passwords, etc), servicing only clients from localhost by default (for local servers) and TLS for public servers in internet. The REST API is as follows:

/index.txt - the index - a list of full ids of ksys in the repo /*.ks* - KS documents /rest/caps.txt - additional supported capabilities /rest/capability_name.json - a path to capability

Because the protocol doesn't requires much interaction, the server can be as simple as a repo on GitHub with precomputed responses.

GreyCat commented 7 years ago

Initial implementation is pushed into repository. meta/imports now works with relative imports more or less as described. One new test — imports0 — is added to test very basic stuff. Much more testing is needed.

GreyCat commented 7 years ago

@koczkatamas Importing files is obviously done differently for JVM and JS versions. The key component is ClassSpecs interface, which is an abstract interface that's supposed to have platform-dependent implementations:

There are basically 2 things to solve with JavaScript's imports support:

My idea for input would be basically that importRelative and importAbsolute of JavaScriptClassSpecs should be implemented in JS and these interface supplied as an argument to MainJs.compile.

My idea for output is even simpler: right now MainJs.compile returns an array of generated files. I propose to generate a map: keys = file names, values = generated sources. It will suit fine both C++ with its two generated files and a file with imports.

JaapAap commented 7 years ago

Speaking of the 'stone-age techniques like "compiler that compiles only one file" and "separate header file needed to be #included".' mentioned by Tamas: to me it would also be useful to be able to be allowed specifying several related data structures in a single file. I believe they now all have to live in their own .ksy file, correct?

GreyCat commented 7 years ago

Having everything in one file is possible since the very first version of KS. You just add all types into types map and that's it. Or you've meant something else?

JaapAap commented 7 years ago

I stand corrected. Not sure why having multiple structures (including multiple meta statements) failed on me but indeed it compiles fine - it might have been a spacing issue as the editor sometimes seems to mess indentations up). However, I was hoping that the following defines two data-structures but I see only the second one being generated:

meta:
  id: struct1
  endian: le
seq:
  - id: data_id1 
    type: u4
meta:
  id: struct2
  endian: le
seq:
  - id: data_id2
    type: u4
koczkatamas commented 7 years ago

@GreyCat This solution sounds reasonable.

But there will be a problem: nowadays you cannot read files synchronously in JavaScript (eg. you have to download the ksy from the internet). If we cannot create asynchronous "callbacks" (eg. Promise-based), then we have to use a multi-stage compiling:

In the first stage we try to discover every required import, this means if the import chain is the following: a.ksy -> b.ksy -> c.ksy then if we call the compile for a.ksy we will get an error that b.ksy is required, then we supply a.ksy and b.ksy as input, but we will still get an error that c.ksy is still required, and when we finally can supply a and b and c.ksy then we will be able to finally compile the whole code.

I am not sure that this is the correct way to go because in every "import discovery round" we have to parse every file from the beginning.

Of course I could parse the ksy files and collect the imports beforehand, this way we don't have to parse every file every discovery round, I only have to parse the newly imported files, collect their imports and if there is not more imports then I supply every file into the compiler.

The drawback of this solution is that we have to implement this logic outside the compiler, but maybe we can supply this additional logic as part of the compiler's JS distribution. That way the user of the compiler only have to implement the async (Promise-based) file loading mechanism.

By that way: shouldn't we distribute the YAML parser too as part of the JS distribution code? This way the user have to only supply the ksy file as a string, everything else will be handled by the Kaitai Compiler.

GreyCat commented 7 years ago

@JaapAap

However, I was hoping that the following defines two data-structures but I see only the second one being generated:

Please take a look at how YAML/JSON work. Generally, it's impossible to have keys with identical names in the same map. That is, if you do:

meta: foo
meta: bar

... only one of the values will survive, i.e. the last one:

YAML.load(doc) # => {"meta"=>"bar"}
JaapAap commented 7 years ago

Thanks for that. Would it be possible to at least make the parser provide a warning or error? I can imagine that when importing from other files duplicate keys could easily arise and it would be nice to be warned at compile time as this could lead to hard-to-find errors.

GreyCat commented 7 years ago

@koczkatamas:

But there will be a problem: nowadays you cannot read files synchronously in JavaScript (eg. you have to download the ksy from the internet). If we cannot create asynchronous "callbacks" (eg. Promise-based), then we have to use a multi-stage compiling:

Ok, understood. I'd really like to avoid over-complicated stuff like multi-stage JS->Scala->JS compiling, or implementing preloading in JS. Let's think on how we can try to introduce async / futures into this general workflow:

I guess we need to check how we can tunnel Promises or some other callback-calling APIs into ScalaJS...

koczkatamas commented 7 years ago

@GreyCat yeah, I understood that this would be the optimal compiling workflow.

I just fear if we want to support async ksy loading then the compiler have to rewritten more broadly you'd expect first. If there is no async-await style support in the language, then you have to rewrite every affected method in the call chain to async I think.

For example you cannot just foreach on every import, because it have to follow some logic like this:

loadImportAsync(ksy.meta.import[0]).then(classSpecs0 =>
  loadImportAsync(ksy.meta.import[1]).then(classSpecs1 => ...

Of course you can wait on all these futures and then collect all the result before going further, but using Promises will require some changes in the current code either way.

But I presume you are aware of this. I just wanted to warn you that these type of changes will be required.

GreyCat commented 7 years ago

@koczkatamas Generally, imports is the only thing that depends on such resources such as I/O and thus requires async. Everything beyond imports may stay totally as is, so it's not that much of a change.

Let's try to imagine some sort of API we might have here. You've proposed that JS part will have provide function(s) that promise a JS object tree for a requested YAML file. I'll try to do some fake functions that do that and experiment with doing ScalaJS counterpart for it, ok?

koczkatamas commented 7 years ago

Yeah, that sounds perfect.

Please use the interface of the standard JS Promise if you can.

I think this is the Scaja.JS counterpart: https://github.com/scala-js/scala-js/blob/master/library/src/main/scala/scala/scalajs/js/Promise.scala

GreyCat commented 7 years ago

@koczkatamas Ok, you were right. That proved to be somewhat harder than I expected. The problem number one is that while regular Scala has Await.result which effectively converts async operations to blocking ones (waits for a result and continues execution only after getting the promise's result), Scala.JS doesn't have one by design. Things would be much easier if it had one ;)

In abscence of wait operation, we're basically dealing with the following: there is a single process (imports analysis of a single file), which launches several import processes, which promise to return 0..1 ClassSpec objects. However, each ClassSpec must be analyzed as well, so it's a recursive thing.

There is a Future.sequence that can schedule an operation that will execute when all futures in some sort of collection will complete, but judging from its implementation, it won't work as intended if we'll modify contents of the collection while it's waiting.

I'll try to test it now and prepare a question to ask on Stackoverflow...

GreyCat commented 7 years ago

Phew, it works. For reference, this is my SO question.

I've pushed compiler changes to a distinct js-async branch.

So, ultimately I came up with the following API:

  1. One needs to implement a class in JS that will adhere to JavaScriptImporter interface. I.e. something like that:
var JavaScriptImporter = (function() {
    function JavaScriptImporter() {}

    JavaScriptImporter.prototype.importYaml = function(name) {
        return new Promise(function(resolve, reject) {
            console.log("loadImportAsync: starting import of " + name);
            setTimeout(function() {
                console.log("loadImportAsync: pretending that we have imported it");
                resolve({"meta":{"id":"second_one"},"seq":[{"id":"foo","type":"u1"}]});
            }, 500);
        });
    };

    return JavaScriptImporter;
})();
  1. One needs to call from JS compiler like that:
var compilerPromise = ks.compile(targetLang, src, new JavaScriptImporter());
  1. Compiler returns promise, so one needs to treat it right:
compilerPromise.catch(function(err) {
    // err is an exception
});

compilerPromise.then(function(r) {
    // r is a map(filename -> generated source)
});
GreyCat commented 7 years ago

@JaapAap

Would it be possible to at least make the parser provide a warning or error?

Unfortunately, we're using external YAML parser (actually, 2 of them: distinct ones for JVM and JS), and, as far as I see, both treat duplicate keys without any errors. For example, there is an issue in SnakeYAML that states that this can be solved by some "custom constructor", but I have no idea how to do one, or at least how to borrow it from Spring Boot project.

In theory, we need a good single YAML parser library in Scala. Unfortunately, it doesn't seem to exist :( I've tried to start my own, but quickly dismissed the idea, as YAML specification is indeed one of the most cryptic specs I've ever encountered (and this is not only my opinion).

GreyCat commented 7 years ago

As for handling absolute ksy search paths, let's see how others do that:

So far, following the principle of least surprise, looks like our choices would be:

The only problem here actually is these "system defaults". In theory, there should be some way to configure a binary build (i.e. for Debian, Windows, etc) to specify platform-dependent location.

GreyCat commented 7 years ago

JVM imports are done and working. I suggest to split remaining discussion points into separate issues and close this one.

ams-tschoening commented 7 years ago

@GreyCat

Tamas: to me it would also be useful to be able to be allowed specifying several related data structures in a single file. I believe they now all have to live in their own .ksy file, correct?

Having everything in one file is possible since the very first version of KS. You just add all types into types map and that's it. Or you've meant something else?

I have the same situation currently, things don't work and I'm wondering if it's implemented the way I thought. Could you please have a look at the following example and tell me if it's possible? Reading comments in LoadImports.scala I had the feeling it is, but I have the feeling I misunderstood things. Importing and using only the top level type of a KSY itself does work.

fmt_oms_bcd_digits.ksy

meta:
  id:      fmt_oms_bcd_digits
  endian:  le

# fmt_oms_bcd_digits:
types:
  bcd_02_digits:
    seq:
      - id:   b1
        type: u1
  [...]     

  bcd_04_digits:
    seq:
  [...]

fmt_oms_rec.ksy

meta:
  id:      fmt_oms_rec
  endian:  le
  imports:
    - fmt_oms_bcd_digits
    - [...]

# fmt_oms_rec:
types:
  fmt_oms_dll_header:
    seq:
      - id:   reading_serial
        size: 4
        type: bcd_08_digits
        doc:  Reading serial of the device.
GreyCat commented 7 years ago

The problem here is that it results in the following hierarchy:

Thus, if you want to access "bcd_08_digits" from anywhere outside fms_oms_bcd_digits or its subtypes, you need somehow to address it with a full name, i.e. something like fmt_oms_bcd_digits::bcd_08_digits or fmt_oms_bcd_digits.bcd_08_digits. It would have worked (actually, there is even relevant code to resolve such complex names), but one thing is lacking: there is no way to specify type: XXX, where XXX is more than one name.

So, effectively, yes, right now one can only use top-level types from imported stuff. Partially, this is by design, because I definitely don't want files to become like libraries of otherwise unrelated types (more or less embracing Java's principle of 1 class = 1 file).

In your particular case, probably it's even more efficient to make a transition to parametric types. I'll try to add BCD-related .ksy implementations to common/ soon.

ams-tschoening commented 7 years ago

The problem here is that it results in the following hierarchy:[...] So, effectively, yes, right now one can only use top-level types from imported stuff.

I have the same problem with an enum, correct? I have quite some large enums and would like to move them to individual files as well, but in the new KSY the enum is a child of the KSY-ID and the enum resolve order shouldn't know about such things? You said enums are only resolved "upwards" in the hierarchy, so would need to be their own file, like a type, which clashes with the assumptions KSY files are types. If I'm not missing anything, any chance/idea how that could be changed to get enums using imported KSY files? Is that worth creating a new issue? One enum per file would be totally OK of course.

In your particular case, probably it's even more efficient to make a transition to parametric types. I'll try to add BCD-related .ksy implementations to common/ soon.

No reason to rush things, I like my pretty verbose implementations for now. Just didn't want to create too many KSY files in my currently used package structure, but I guess that can be solved by creating some new structure as well.

GreyCat commented 7 years ago

I have the same problem with an enum, correct?

Well, yes and no. The overall picture is similar, but enum has more details, namely:

Moreover, you're correct, currently every ksy file is a type. I'm not sure that it's a good idea to break that principle. If we're keeping it, it leads to every enum effectively having a class-based namespace. If we'll get rid of it, we'll have to think of a way to address these "homeless" enums (I'm actually not even sure that all our target languages support that).

No reason to rush things, I like my pretty verbose implementations for now.

I've uploaded sample bcd.ksy. It's currently lacking support for parsing half-byte digits when order of digits is different from "msb-to-lsb", but I'm not sure if that shouldn't be fixed on bit reading level (i.e. support for reading bits msb-to-lsb vs lsb-to-msb).

atomass commented 7 years ago

I think that the ability to declare common types/enum in separate ksy file could lead to better readability and file re-usability. I'm facing myself this problem because I have to de-serialize a communication protocol that has different messages that shares some types/enums.

GreyCat commented 7 years ago

@garlix Importing common types with meta/import support has been there since v0.7. Enums is somewhat trickier and it's been under consideration for quite some time. Probably we'll need a separate issue for that to avoid forgetting...

ams-tschoening commented 6 years ago

Not sure if it's already discussed elsewhere but I stumble across this right now and would like to mention it: If one externalizes a KSY and imports it later in some other type, how are _root and _parent supposed to work in the extern type? It seems those are not crossing KSY borders currently? Because _root in my extern type resolves to the extern type only, not the type e.g. importing my extern type.

The only reason why the type using _root is extern is because to reduce lines of code in the "main KSY". If it's not extern, _root works as expected and allows access to the outermost type within the main KSY. Additionally, that extern type never has different _roots, it's only used at one place, so in theory KS could provide some "super/outer/..."-_root which leaves the current KSY and provides the importing type.

Am I correct and this is a limitation or am I doing something wrong? Might this be related to #275, to get some ::_root or such some day? Like in C++, with its absolute and relative namespace paths.

KOLANICH commented 6 years ago

Modules are modular. If they accessed parent types, for example using _root, they would be not modular, so they would not be modules. And this was discussed somewhere.

In #275 I meant enums and types, not fields. But if you need to allow types to access caller module root, we need another syntax, not _root. And there must be some checks and mechanisms in compiler. IMHO passing something (a reference to parent's field?) as a param is a better approach.

GreyCat commented 6 years ago

I believe it was discussed and the general consensus is more or less what @KOLANICH says: ksys are supposed to be self-sufficient, i.e. we only want to support dependencies which are forward & explicitly stated in imports. It would be a huge mess if _root relied on this particular ksy being imported from some other ksy, and only work in that case.

If you want to pass this "mega root" across the ksy boundaries, you can do it now using parameterized types, i.e.:

main_type.ksy

meta:
  id: main_type
  imports:
    - imported_type
seq:
  - id: some_secret_value
    type: u1
  - id: call_included
    type: imported_type(some_secret_value) # <= pass it explicitly

imported_type.ksy

meta:
  id: imported_type
params:
  - id: my_secret_value
    type: u1
seq:
  - id: my_str
    # size: _root.some_secret_value # <= won't work due to file boundaries
    size: my_secret_value # <= use it here
KOLANICH commented 6 years ago

@GrayCat, can _io, _root and other service objects (for example chains proposed in #196) be passed this way (in all runtimes)?

GreyCat commented 6 years ago

@KOLANICH _io and _root should be ok. _io would need declaration like type: io, and _root could be passed with type: struct (or, with an explicit circular import, with full proper type name).

Also please don't bother GrayCat ;)

ams-tschoening commented 6 years ago

I tried the approach using parameterized types like you suggested and got the following error message, which is not very clear to me. Could you explain what that actually means so that I have a clue what I'm doing wrong?

fmt_oms_rec: /meta/imports/18: fmt_oms_control_info (of class java.lang.String)

Additionally, I thought of other possibilities to split my large file into pieces, like you suggested in the past already e.g. by simply splitting the file on text level and concatenate it in some kind of build script. Such an approach can be seen as pre-processor step and I searched a bit around that keyword if YAML already provides mechanisms to do so. I've found some approaches in that direction and wanted to ask if there's any interest in discussing such an approach as well? Not as replacement for the imports etc. of course, but more as an additional layer which might be easier to implement in the compiler than dealing with the current limitations around enums and all such. This might save people from reinventing the wheel for use-cases in which the main goal is only to reduce lines of code in a KSY file, for which some YAML-level include might be enough already.

So, any interest or is it a no-go? I'm fine if it's the latter of course, was just thinking in some directions.

GreyCat commented 6 years ago

Could you explain what that actually means so that I have a clue what I'm doing wrong?

Looks like a bug in compiler which hides the actual exception. Please try compiling it with --ksc-exceptions enabled and post the whole trace?

I've found some approaches in that direction and wanted to ask if there's any interest in discussing such an approach as well?

I feel like implementing proper stuff like a::b::c::enum_name is probably faster than discussing and inventing alternative approaches. It's not like there's some hard blocker there or anything.

Mingun commented 2 weeks ago

I think, this is time to close this issue, because imports was implemented a long time ago.