Open koczkatamas opened 7 years ago
io.kaitai.serialization.asn1
application/ber-stream:1.0@kaitai.io
is better,
where
application/ber-stream
is a mime type
1.0@kaitai.io
is a KS document version
where
kaitai.io
is any unique identifier to distinguish between different mutually incompatible versions of the description of the same format
1.0
is a local version (any identifier, for example git commit hash or release name)
@koczkatamas I completely agree on that we'd want some import mechanism and file format identification. Basically, all the history of programming language development proves that it's (at least currently) is a right way to go, and we shouldn't restrict ourselves to stone-age techniques like "compiler that compiles only one file" and "separate header file needed to be #include
d".
There are generally two approaches to solve that problem: (1) n source files + 1 project file, (2) n sources files with some import-like statements. Probably a "project" file for KS would be overkill, so it's natural to discuss an "import" solution. Probably something like that will do the trick:
meta:
id: ip_packet
imports:
- some_id_of_tcp_packet
- some_id_of_udp_packet
As for naming, it's actually somewhat complicated matter. For starters, we'll need to match some namespacing scheme in existing languages. Right now, we don't touch that subject, except for enforcing some namespace/package by language-specific CLI options (i.e. --php-namespace
, --dotnet-namespace
, etc). And quite a few languages do not have a concept of namespaces at all.
So, my proposal is to start with very simple stuff like:
foo
should just look for foo.ksy
in current directory (or some sort of equivalent in Web IDE)foo/bar
should look for foo/bar.ksy
from current directory/foo/bar/baz
should look for foo/bar/baz.ksy
in some sort of "ksy root" (akin to gcc's -I
directory, Java's CLASSPATH
directory, Python's PYTHONPATH
, Ruby's $LOAD_PATH
, etc)By default, it's good idea to point this "ksy root" to something like format repo checkout. Obviously, there could be more than one "root" (as in search path), and there should be some option to specify it in ksc
, something like ksc -I /path/to/my/format/lib
.
Note that this proposal does not ban us from using more schemes to specify importing components in future. After all, we can actually use arbitrary map in import specification, something like that (Maven-style):
imports:
- package: tcp
vendor: some_one
group: some_group
version: 1.7
repo: http://foo.bar.example.com/some/path
@KOLANICH Using MIME type as an identifier is a bad idea: tons of formats lack proper MIME type and in those who have MIME type, it's not unique (for example, audio/wave
, audio/wav
, audio/x-wav
, audio/x-pn-wav
, audio/vnd.wave
, audio/x-riff
are used for the same plain .wav audio files). Lots of binary formats that KS parsers are not file formats at all (i.e. network protocols, etc), and thus they would never have a MIME type.
@KOLANICH Using MIME type as an identifier is a bad idea: tons of formats lack proper MIME type and in those who have MIME type, it's not unique (for example, audio/wave, audio/wav, audio/x-wav, audio/x-pn-wav, audio/vnd.wave, audio/x-riff are used for the same plain .wav audio files).
For this case there should be a possibility to declare multiple mime types for a single ks document.
Lots of binary formats that KS parsers are not file formats at all (i.e. network protocols, etc), and thus they would never have a MIME type.
We can use own mime-types for them if we specify own additions to top level media types and document them in an rfc :).
Mime-type can be described as meta information inside the file, see this issue for further information: https://github.com/kaitai-io/kaitai_struct/issues/59.
I agree with @GreyCat , I don't like the idea of referencing .ksy descriptors via mime types either.
Mime-type can be described as meta information inside the file
It certainly can.
I don't like the idea of referencing .ksy descriptors via mime types either.
The original idea was to make different versions of ksy for the same type interchangeable and allow KS docs to import any (suitable) version available by either not specifying either a full version (then any (means "implementation defined") found will be used), or a local version (then any version with the same version ID will be used), or specifying constraints (which work only for local versions in known formats: git commit hash and sem. versioning).
About versioning implementation: it is IMHO obvious. One process opens a HTTP service, scans the specified folder and processes the requests. Another process is KS compiler which uses HTTP REST API to get the files it needs. The security is through challenge-response authorization (captchas, OpenPGP, x509, passwords, etc), servicing only clients from localhost by default (for local servers) and TLS for public servers in internet. The REST API is as follows:
/index.txt
- the index - a list of full ids of ksy
s in the repo
/*.ks*
- KS documents
/rest/caps.txt
- additional supported capabilities
/rest/capability_name.json
- a path to capability
Because the protocol doesn't requires much interaction, the server can be as simple as a repo on GitHub with precomputed responses.
Initial implementation is pushed into repository. meta/imports
now works with relative imports more or less as described. One new test — imports0 — is added to test very basic stuff. Much more testing is needed.
@koczkatamas Importing files is obviously done differently for JVM and JS versions. The key component is ClassSpecs interface, which is an abstract interface that's supposed to have platform-dependent implementations:
FileReader
to read local files + SnakeYAML to parse them???
and thus will throw "not implemented" exception if imports
is used in JSThere are basically 2 things to solve with JavaScript's imports support:
My idea for input would be basically that importRelative
and importAbsolute
of JavaScriptClassSpecs
should be implemented in JS and these interface supplied as an argument to MainJs.compile
.
My idea for output is even simpler: right now MainJs.compile
returns an array of generated files. I propose to generate a map: keys = file names, values = generated sources. It will suit fine both C++ with its two generated files and a file with imports.
Speaking of the 'stone-age techniques like "compiler that compiles only one file" and "separate header file needed to be #included".' mentioned by Tamas: to me it would also be useful to be able to be allowed specifying several related data structures in a single file. I believe they now all have to live in their own .ksy file, correct?
Having everything in one file is possible since the very first version of KS. You just add all types into types
map and that's it. Or you've meant something else?
I stand corrected. Not sure why having multiple structures (including multiple meta statements) failed on me but indeed it compiles fine - it might have been a spacing issue as the editor sometimes seems to mess indentations up). However, I was hoping that the following defines two data-structures but I see only the second one being generated:
meta:
id: struct1
endian: le
seq:
- id: data_id1
type: u4
meta:
id: struct2
endian: le
seq:
- id: data_id2
type: u4
@GreyCat This solution sounds reasonable.
But there will be a problem: nowadays you cannot read files synchronously in JavaScript (eg. you have to download the ksy from the internet). If we cannot create asynchronous "callbacks" (eg. Promise-based), then we have to use a multi-stage compiling:
In the first stage we try to discover every required import, this means if the import chain is the following: a.ksy
-> b.ksy
-> c.ksy
then if we call the compile for a.ksy
we will get an error that b.ksy
is required, then we supply a.ksy
and b.ksy
as input, but we will still get an error that c.ksy
is still required, and when we finally can supply a
and b
and c.ksy
then we will be able to finally compile the whole code.
I am not sure that this is the correct way to go because in every "import discovery round" we have to parse every file from the beginning.
Of course I could parse the ksy files and collect the imports beforehand, this way we don't have to parse every file every discovery round, I only have to parse the newly imported files, collect their imports and if there is not more imports then I supply every file into the compiler.
The drawback of this solution is that we have to implement this logic outside the compiler, but maybe we can supply this additional logic as part of the compiler's JS distribution. That way the user of the compiler only have to implement the async (Promise-based) file loading mechanism.
By that way: shouldn't we distribute the YAML parser too as part of the JS distribution code? This way the user have to only supply the ksy file as a string, everything else will be handled by the Kaitai Compiler.
@JaapAap
However, I was hoping that the following defines two data-structures but I see only the second one being generated:
Please take a look at how YAML/JSON work. Generally, it's impossible to have keys with identical names in the same map. That is, if you do:
meta: foo
meta: bar
... only one of the values will survive, i.e. the last one:
YAML.load(doc) # => {"meta"=>"bar"}
Thanks for that. Would it be possible to at least make the parser provide a warning or error? I can imagine that when importing from other files duplicate keys could easily arise and it would be nice to be warned at compile time as this could lead to hard-to-find errors.
@koczkatamas:
But there will be a problem: nowadays you cannot read files synchronously in JavaScript (eg. you have to download the ksy from the internet). If we cannot create asynchronous "callbacks" (eg. Promise-based), then we have to use a multi-stage compiling:
Ok, understood. I'd really like to avoid over-complicated stuff like multi-stage JS->Scala->JS compiling, or implementing preloading in JS. Let's think on how we can try to introduce async / futures into this general workflow:
MainJS.compile(lang, yaml, ...)
, expecting it to return some compiled code (like a map of results, or whatever)MainJS.compile
:
ClassSpecs
container (that is used to store multiple top-level ClassSpec
entries)yaml
(which is not actually a YAML-as-string, but YAML-already-parsed-to-JS-objects) into ClassSpec
TypeProcessor.processTypesMany(specs, firstSpec, config)
TypeProcessor.processTypesMany
:
firstSpec
it to ClassSpecs
new LoadImports(specs).processClass(firstSpec)
LoadImports.processClass
:
loadImport
for every meta/import
found in a ClassSpec
ClassSpec
LoadImports.loadImport
:
imports
arrayspecs.importAbsolute(...)
or specs.importRelative(...)
, expecting it to return a loaded ClassSpec
(both file loading and YAML→object tree parsing is platform-specific, object tree→ClassSpec parsing is shared across platforms)specs
and calls LoadImports.processClass
recursively for newly imported stuffI guess we need to check how we can tunnel Promises or some other callback-calling APIs into ScalaJS...
@GreyCat yeah, I understood that this would be the optimal compiling workflow.
I just fear if we want to support async ksy loading then the compiler have to rewritten more broadly you'd expect first. If there is no async-await style support in the language, then you have to rewrite every affected method in the call chain to async I think.
For example you cannot just foreach on every import, because it have to follow some logic like this:
loadImportAsync(ksy.meta.import[0]).then(classSpecs0 =>
loadImportAsync(ksy.meta.import[1]).then(classSpecs1 => ...
Of course you can wait on all these futures and then collect all the result before going further, but using Promises will require some changes in the current code either way.
But I presume you are aware of this. I just wanted to warn you that these type of changes will be required.
@koczkatamas Generally, imports is the only thing that depends on such resources such as I/O and thus requires async. Everything beyond imports may stay totally as is, so it's not that much of a change.
Let's try to imagine some sort of API we might have here. You've proposed that JS part will have provide function(s) that promise a JS object tree for a requested YAML file. I'll try to do some fake functions that do that and experiment with doing ScalaJS counterpart for it, ok?
Yeah, that sounds perfect.
Please use the interface of the standard JS Promise if you can.
I think this is the Scaja.JS counterpart: https://github.com/scala-js/scala-js/blob/master/library/src/main/scala/scala/scalajs/js/Promise.scala
@koczkatamas Ok, you were right. That proved to be somewhat harder than I expected. The problem number one is that while regular Scala has Await.result
which effectively converts async operations to blocking ones (waits for a result and continues execution only after getting the promise's result), Scala.JS doesn't have one by design. Things would be much easier if it had one ;)
In abscence of wait operation, we're basically dealing with the following: there is a single process (imports
analysis of a single file), which launches several import processes, which promise to return 0..1 ClassSpec
objects. However, each ClassSpec
must be analyzed as well, so it's a recursive thing.
There is a Future.sequence that can schedule an operation that will execute when all futures in some sort of collection will complete, but judging from its implementation, it won't work as intended if we'll modify contents of the collection while it's waiting.
I'll try to test it now and prepare a question to ask on Stackoverflow...
Phew, it works. For reference, this is my SO question.
I've pushed compiler changes to a distinct js-async branch.
So, ultimately I came up with the following API:
var JavaScriptImporter = (function() {
function JavaScriptImporter() {}
JavaScriptImporter.prototype.importYaml = function(name) {
return new Promise(function(resolve, reject) {
console.log("loadImportAsync: starting import of " + name);
setTimeout(function() {
console.log("loadImportAsync: pretending that we have imported it");
resolve({"meta":{"id":"second_one"},"seq":[{"id":"foo","type":"u1"}]});
}, 500);
});
};
return JavaScriptImporter;
})();
var compilerPromise = ks.compile(targetLang, src, new JavaScriptImporter());
compilerPromise.catch(function(err) {
// err is an exception
});
compilerPromise.then(function(r) {
// r is a map(filename -> generated source)
});
@JaapAap
Would it be possible to at least make the parser provide a warning or error?
Unfortunately, we're using external YAML parser (actually, 2 of them: distinct ones for JVM and JS), and, as far as I see, both treat duplicate keys without any errors. For example, there is an issue in SnakeYAML that states that this can be solved by some "custom constructor", but I have no idea how to do one, or at least how to borrow it from Spring Boot project.
In theory, we need a good single YAML parser library in Scala. Unfortunately, it doesn't seem to exist :( I've tried to start my own, but quickly dismissed the idea, as YAML specification is indeed one of the most cryptic specs I've ever encountered (and this is not only my opinion).
As for handling absolute ksy search paths, let's see how others do that:
-I
, use multiple times for multiple pathsinclude "..."
), CLI paths, system defaultscpp -v
-cp
or -classpath
, OS-dependent path separatorCLASSPATH
, OS-dependent path separatorjre/lib/
), extension classes (jre/lib/ext/
), (CLI > env var > current dir)PYTHONPATH
, OS-dependent path separatorsys.path
-I
, OS-dependent path separator or multiple timesRUBYLIB
, OS-dependent path separator$LOAD_PATH
NODE_PATH
, OS-dependent path separator-I
, multiple times for multiple pathsPERLLIB
(or PERL5LIB
), OS-dependent path separator@INC
So far, following the principle of least surprise, looks like our choices would be:
-I
option (doubled in long version as something like --import-path
), both as specified multiple times, or with OS-dependent separatorsKSPATH
, following the same logicThe only problem here actually is these "system defaults". In theory, there should be some way to configure a binary build (i.e. for Debian, Windows, etc) to specify platform-dependent location.
JVM imports are done and working. I suggest to split remaining discussion points into separate issues and close this one.
@GreyCat
Tamas: to me it would also be useful to be able to be allowed specifying several related data structures in a single file. I believe they now all have to live in their own .ksy file, correct?
Having everything in one file is possible since the very first version of KS. You just add all types into types map and that's it. Or you've meant something else?
I have the same situation currently, things don't work and I'm wondering if it's implemented the way I thought. Could you please have a look at the following example and tell me if it's possible? Reading comments in LoadImports.scala
I had the feeling it is, but I have the feeling I misunderstood things. Importing and using only the top level type of a KSY itself does work.
fmt_oms_bcd_digits.ksy
meta:
id: fmt_oms_bcd_digits
endian: le
# fmt_oms_bcd_digits:
types:
bcd_02_digits:
seq:
- id: b1
type: u1
[...]
bcd_04_digits:
seq:
[...]
fmt_oms_rec.ksy
meta:
id: fmt_oms_rec
endian: le
imports:
- fmt_oms_bcd_digits
- [...]
# fmt_oms_rec:
types:
fmt_oms_dll_header:
seq:
- id: reading_serial
size: 4
type: bcd_08_digits
doc: Reading serial of the device.
The problem here is that it results in the following hierarchy:
Thus, if you want to access "bcd_08_digits" from anywhere outside fms_oms_bcd_digits
or its subtypes, you need somehow to address it with a full name, i.e. something like fmt_oms_bcd_digits::bcd_08_digits
or fmt_oms_bcd_digits.bcd_08_digits
. It would have worked (actually, there is even relevant code to resolve such complex names), but one thing is lacking: there is no way to specify type: XXX
, where XXX
is more than one name.
So, effectively, yes, right now one can only use top-level types from imported stuff. Partially, this is by design, because I definitely don't want files to become like libraries of otherwise unrelated types (more or less embracing Java's principle of 1 class = 1 file).
In your particular case, probably it's even more efficient to make a transition to parametric types. I'll try to add BCD-related .ksy implementations to common/
soon.
The problem here is that it results in the following hierarchy:[...] So, effectively, yes, right now one can only use top-level types from imported stuff.
I have the same problem with an enum, correct? I have quite some large enums and would like to move them to individual files as well, but in the new KSY the enum is a child of the KSY-ID and the enum resolve order shouldn't know about such things? You said enums are only resolved "upwards" in the hierarchy, so would need to be their own file, like a type, which clashes with the assumptions KSY files are types. If I'm not missing anything, any chance/idea how that could be changed to get enums using imported KSY files? Is that worth creating a new issue? One enum per file would be totally OK of course.
In your particular case, probably it's even more efficient to make a transition to parametric types. I'll try to add BCD-related .ksy implementations to common/ soon.
No reason to rush things, I like my pretty verbose implementations for now. Just didn't want to create too many KSY files in my currently used package structure, but I guess that can be solved by creating some new structure as well.
I have the same problem with an enum, correct?
Well, yes and no. The overall picture is similar, but enum has more details, namely:
enum: XXX
clause, which is handled in a manner similar to type: XXX
. Indeed, it requires support for complex type designations, like foo::bar::baz
as well.foo::bar
. The big problem here is that this construction is unambiguously treated in expression language as (enum name) + "::" + (enum key)
. It is non-trivial to upgrade that to allow even stuff like type_name::enum_name::enum_key
, and even more complex to do type_name_foo::type_name_bar::enum_name::enum_key
.Moreover, you're correct, currently every ksy file is a type. I'm not sure that it's a good idea to break that principle. If we're keeping it, it leads to every enum effectively having a class-based namespace. If we'll get rid of it, we'll have to think of a way to address these "homeless" enums (I'm actually not even sure that all our target languages support that).
No reason to rush things, I like my pretty verbose implementations for now.
I've uploaded sample bcd.ksy. It's currently lacking support for parsing half-byte digits when order of digits is different from "msb-to-lsb", but I'm not sure if that shouldn't be fixed on bit reading level (i.e. support for reading bits msb-to-lsb vs lsb-to-msb).
I think that the ability to declare common types/enum in separate ksy file could lead to better readability and file re-usability. I'm facing myself this problem because I have to de-serialize a communication protocol that has different messages that shares some types/enums.
@garlix Importing common types with meta/import
support has been there since v0.7. Enums is somewhat trickier and it's been under consideration for quite some time. Probably we'll need a separate issue for that to avoid forgetting...
Not sure if it's already discussed elsewhere but I stumble across this right now and would like to mention it: If one externalizes a KSY and imports it later in some other type, how are _root
and _parent
supposed to work in the extern type? It seems those are not crossing KSY borders currently? Because _root
in my extern type resolves to the extern type only, not the type e.g. importing my extern type.
The only reason why the type using _root
is extern is because to reduce lines of code in the "main KSY". If it's not extern, _root
works as expected and allows access to the outermost type within the main KSY. Additionally, that extern type never has different _root
s, it's only used at one place, so in theory KS could provide some "super/outer/..."-_root
which leaves the current KSY and provides the importing type.
Am I correct and this is a limitation or am I doing something wrong? Might this be related to #275, to get some ::_root
or such some day? Like in C++, with its absolute and relative namespace paths.
Modules are modular. If they accessed parent types, for example using _root, they would be not modular, so they would not be modules. And this was discussed somewhere.
In #275 I meant enums and types, not fields. But if you need to allow types to access caller module root, we need another syntax, not _root
. And there must be some checks and mechanisms in compiler. IMHO passing something (a reference to parent's field?) as a param is a better approach.
I believe it was discussed and the general consensus is more or less what @KOLANICH says: ksys are supposed to be self-sufficient, i.e. we only want to support dependencies which are forward & explicitly stated in imports
. It would be a huge mess if _root
relied on this particular ksy being imported from some other ksy, and only work in that case.
If you want to pass this "mega root" across the ksy boundaries, you can do it now using parameterized types, i.e.:
meta:
id: main_type
imports:
- imported_type
seq:
- id: some_secret_value
type: u1
- id: call_included
type: imported_type(some_secret_value) # <= pass it explicitly
meta:
id: imported_type
params:
- id: my_secret_value
type: u1
seq:
- id: my_str
# size: _root.some_secret_value # <= won't work due to file boundaries
size: my_secret_value # <= use it here
@GrayCat, can _io
, _root
and other service objects (for example chains proposed in #196) be passed this way (in all runtimes)?
@KOLANICH _io
and _root
should be ok. _io
would need declaration like type: io
, and _root
could be passed with type: struct
(or, with an explicit circular import, with full proper type name).
Also please don't bother GrayCat ;)
I tried the approach using parameterized types like you suggested and got the following error message, which is not very clear to me. Could you explain what that actually means so that I have a clue what I'm doing wrong?
fmt_oms_rec: /meta/imports/18: fmt_oms_control_info (of class java.lang.String)
Additionally, I thought of other possibilities to split my large file into pieces, like you suggested in the past already e.g. by simply splitting the file on text level and concatenate it in some kind of build script. Such an approach can be seen as pre-processor step and I searched a bit around that keyword if YAML already provides mechanisms to do so. I've found some approaches in that direction and wanted to ask if there's any interest in discussing such an approach as well? Not as replacement for the imports etc. of course, but more as an additional layer which might be easier to implement in the compiler than dealing with the current limitations around enums and all such. This might save people from reinventing the wheel for use-cases in which the main goal is only to reduce lines of code in a KSY file, for which some YAML-level include might be enough already.
So, any interest or is it a no-go? I'm fine if it's the latter of course, was just thinking in some directions.
Could you explain what that actually means so that I have a clue what I'm doing wrong?
Looks like a bug in compiler which hides the actual exception. Please try compiling it with --ksc-exceptions
enabled and post the whole trace?
I've found some approaches in that direction and wanted to ask if there's any interest in discussing such an approach as well?
I feel like implementing proper stuff like a::b::c::enum_name
is probably faster than discussing and inventing alternative approaches. It's not like there's some hard blocker there or anything.
I think, this is time to close this issue, because imports was implemented a long time ago.
I feel the need for multiple, somewhat coupled features, but I am not exactly sure how to approach the problem, so I try to describe them here, so we can discuss them and then decide whether we should open distinct issues or not.
External type / ksy inclusion
For example in the Mach-O format there is an ASN.1 structure in one of the code signature blocks. So I'd like to indicate in the
macho.ksy
at this code signature field that it requires the ASN.1 "module" and which type should be used from there.This way the ASN.1 module can be automatically imported, so it won't be the responsibility of the end-user anymore to import every required module. (For example the WebIDE could import the required additional .ksy files before compiling the parser code.)
Also the user could determine what should happen if the module is missing: an error should be thrown, or the parsing should be skipped and the inner contents will be available only as a byte array, etc.
But to reference these external modules we need...
Module names and versions
If we use unique names (ex.
io.kaitai.serialization.asn1
) for every module instead of direct references (ex.../serialization/asn1/asn1_der.ksy
) then module dependencies will be weakly-coupled and can be resolved dynamically. Even searching in multiple repositories (for example an internal repository and a public, kaitai.io repository).As these modules can independently change, we need some kind of versioning too. So we can make sure exactly the same module will be used and thus the parsing won't break after an update suddenly.
Maybe it's worth to use the 'major version change breaks the compatibility' rule here too, otherwise try to keep the compatibility.
Extension points for internal content
Sometimes we need to invert the importing progress.
Let's take the example of TCP packet. You cannot list every possible internal packet in the
tcp.ksy
, so we need some kind of extension / merging support.One of the options is that an included, but external .ksy could extend the TCP module by adding new enums and switch cases.
Other option is to indicate that one of the fields is an 'extension point' and a static or even a dynamic loader could search though all the extensions which published that they can extend
io.kaitai.network.tcp_segment.body
and select the appropriate (by their filter expression or magic matching pattern, etc).In the latter case the best would be if we could decouple the
extension points
fromextensions
. For example neither the TCP packet should know that it can accept SSL extension, nor the SSL extension should know that it can be embedding into TCP, but we could use an external mapping file which lists these mappings. Although if we use magic matching pattern (or at least give hints to the application) then maybe we don't need to couple things together manually at all.I know it's a lot and chaotic at first, but I think it's worth to start discussing about these possible improvements.