kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.04k stars 199 forks source link

Support of packaging of generated modules for target languages #1006

Open KOLANICH opened 1 year ago

KOLANICH commented 1 year ago

An issue somehow related to #339, but a bit different.

When we transpile a spec using other specs via imports, for all of them source code is generated.

a -> c, d;
b -> c, e;

The specs a and b depend on the shared spec c and also need d and e specs.

Let's imagine that the code generated from specs is integrated into software. Let's imagine a and b are to be put into own packages.

If we vendorize c into each of them, it will cause a number of issues:

So we need a package for c.

Let's also imagine that d and e are just the specs introduced for convenience and tightly coupled wuth a and b, and so should go into the same package as a and b.

When I transpile specs into Python, I usually place the results into an own subdir of my package and provide the arguments so that the transpiled code uses relative import (import from that dir, from . import module_name).

So we need:

  1. a mechanism to define a mapping spec <-> target language package.
  2. a mechanism to define which target language packages are generated in this compilation;
  3. the documents specifying the structure and conventions of target language packages containing code generated from KS specs;
  4. machinery within KSC generating the right imports;
  5. automation of generation of the packages;
  6. storage of pretranspiled packages.

Ideas on the impl:

  1. let's define a KS specs index as a repo like https://github.com/kaitai-io/kaitai_struct_formats
  2. each KS specs index can contain a file in the root specifying the mapping. The mapping can be in any JSON-like serialization format KSC supports for specs (currently it is only YAML). Since each KS spec in this model can belong only to 1 target language package, and each target language package can contain multiple specs, the keys are the names of target language packages and values are ids of specs. KSC should have a flag for scanning the specs index and adding the missing specs into this file and removing the ones that are missing from it. The file with the serialization is the source of truth. KSC can also create a binary file with a database near it with the same stem of the filename but a different extension in order to be able to look up specs faster than parsing and converting the mapping file on each KSC run.
  3. usually legacy behavior is kept. A new behavior is enabled by a simple and short CLI argument. In this mode the compiler can accept either a name of a package, or a path to a ksy. The specs a user has directly ordered the compiler to generate code from we call root specs for the purposes of this issue. When generating source code from a ksy, KSC

a) looks up the id of the root spec in the mapping. If there is no id within the mapping, it assummes that the package with the name matching spec id is generated, and checks, if it exists in the mapping as a name for a target language package. If it exists, then error. b) for each spec that is not a root one KSC doesn't generate code.

If a user orders to generate a package, KSC reads the mapping file, determines the specs to be transpiled, and transpiles them.

  1. In general, each target language source package a) uses the language-specific build tools to be built b) is a repo in a VCS/SCM c) contains the KSC-generated recepie files d) has the layout allowing to import the subspecs of it as target_language_package_name.KS_spec_id. e) can contain unit tests, that don't go into the binary package. f) naming convention: package name should contain ks- (or maybe kaitai-, or kaitai-struct-) prefix (or suffix). Maybe in appropriate style specific ti the target language (camelCase, snake_case and so on). g) the binary packages built from recepies should use dynamic linkage and be properly installed. h) metadata of packages is mostly inherited from the root spec i) the recepie files should contain proper information about dependencies

  2. When code is generated, for each spec KSC determines a package it belongs to. All the specs that are from the current package are imported using relative import syntax. All the specs from other packages are imported using global import syntax.

  3. a user specifies an output dir into which KSC creates dirs, which can be finalized by the proper tooling into the binary packages. It also creates a file with topologically sorted graph of package names, each prefixed with an action. By the user's choice (should be configurable via a switch), it can be a format for graphs, or a simple topo-sorted list of actions. There are 2 actions b standing for build and d standing for a dep separated fom a package name by a space (`). If there is no spaces in a line, the action isbuild. Then a tool acting according to the following algorithm can be run against the dir: a) a line is read from the file b) if it is adep, it is installed (or verified, that it is installed into the system), if the target needs it. If the target doesn't need "headers", the command can be just ignored. c) if it isbuild`, then a tool creating a binary package is run against the dir, the resulting package is then installed, if the target needs that. Tests can be run here. d) the next line is read.

The simple list format is simple enough to be read by a bash script. The graph-based format allows parallelization.

The info in these ordering files is redundant, the same info is contained within packages metadata. But having it combined in one place allows the build scripts not to scan the dirs for packages and not to parse their metadata.

  1. a dedicated GitHub organization can be created, where each repo is for each target language package. The target language of the package can be distinguished by the repo name suffix, for example .py for Python packages and .cpp for C++ ones. KSF CI cna automatically update them based on the changes in KSF repo and KSC.

Note on C++ - for C++ it is proposed to use CMake CPack as a recepie format.