"Big formats" model - Githubissues

GreyCat commented 7 years ago

We have been discussing this from time to time for probably half-a-year already, so I guess it's time to get this rolling.

The idea is simple: some formats (for example, Adobe Photoshop .psd format, Java .class, Corel Draw .cdr, Adobe Flash .swf, Microsoft Office formats, etc) are pretty complex and developing .ksy files for them takes a lot of time and effort. It is pretty uncomfortable to develop it as a single file in fork of our standard kaitai_struct_formats repo. It's much better idea to have one-repo-per-such-"big"-format development model.

This would allow potentially to have distinct space to store:

file format samples to test dissection
more documentation / links to documentation / explanations / developers' notes
test units
some utilities that are made to aid development of .ksy (for example, disassemblers, dumpers, etc)
some format manipulation utilities made on top of that .ksy, useful for end-users
etc, etc.

Having distinct repositories also helps a lot in collaboration, as you can just give out write rights to anyone you care, skipping longer pull request procedures to main KSF repo.

Thus, I propose to discuss overall "recommended" layout of such a repository, what it should and should not have, how to name them, etc.

Cc @davidhicks @koczkatamas @LogicAndTrick @KOLANICH

GreyCat commented 7 years ago

To make this talk more substantial, here's my initial proposal and proof-of-concept: https://github.com/kaitai-io/ksy_java_bytecode

Points I'd like to discuss and make "standard":

ksy_$NAME repo name, where $NAME is a main .ksy name
/$NAME.ksy is the location of main .ksy file in this repository
$TITLE spec for Kaitai Struct is GitHub repo name
README.md with more or less generic text that would explain people what this repository is, how to use it, how it is structured, how to contribute, etc.

davidhicks commented 7 years ago

I agree with the need to either bundle sample files, or have an automated test mechanism which can download sample files from external websites.

For testing of a .ksy specification against sample files, one approach could be to convert an input binary file into an XML document, YAML document or similar which can then be compared with a common tool such as diff to against previous known test results. Do you have any thoughts on whether this would be a suitable test approach? This kind of approach is commonly used to test relational database server software and/or schemas in use by various applications.

GreyCat commented 7 years ago

Yeah, that should be more or less suitable approach. Web IDE can already dump whole structure recursively into a JSON file. In fact, building a recursive dumper in any language that has basic reflection capabilities should be pretty quick and easy. A few things to discuss here:

target dump format — ideally, there should be easy way to diff it against expected and/or previous dump results
infinitely recursive structures — some formats have them, so we need to invent a way to dump these in a finite (and, preferably, minimal) file

KOLANICH commented 7 years ago

I think a bit differrent.

ksy_$NAME repo name, where $NAME is a main .ksy name

maybe $NAME.ksy (analogiously to *.js for JS libraries used even if a library has multiple files)?

/$NAME.ksy is the location of main .ksy file in this repository

The whole folder should have ONLY the set of .ksy files specific to the format and optional markdown doc. Unspecific files must go into the main kaitai_struct_formats repo.

Also there we can have a folder for tests, but I'm not sure if we need this. Most of Kaitai-generated parsers need some postprocessing, so I guess that the tests should go into the libraries' repo.

$TITLE spec for Kaitai Struct is GitHub repo name

what is $TITLE?

and the most important: 1 main repo should incorporate standalone repos as submodules 2 in standalone repos the addressation should assume the main repo as global root to be able to use the formats published there.

Postprocessing libraries

In order to build a postprocessing library, the library's repo has a main ks_formats repo as a submodule. It updates submodules, that way it gets all the repos connected to that, including the one used by the library. It has a drawback - you will have to download whole repo and all its subrepos. It can consume lot of space and time. The better solution is to download only ksy files needed. Fortunately it can be done within this approach with the help of a simple python (or any other) script making graph traversal and http queries. The optimal way to fetch the repo is chosen by postprocessing lib dev by passing a parameter to fetching and compiling script. This script should be integrated with language package manager, for example I have started development of a setupttols plugin for python.

GreyCat commented 7 years ago

maybe $NAME.ksy

Hmm, I think I like it — clear and concise.

The whole folder should have ONLY the set of .ksy files specific to the format and optional markdown doc.

Sorry, I don't quite understand. Probably we're talking about the same thing: main .ksy should be placed in root directory of a repository, additional .ksy files should be placed there too (if needed).

Unspecific files must go into the main kaitai_struct_formats repo.

What exactly are "unspecific files"?

Also there we can have a folder for tests, but I'm not sure if we need this.

We do, that's the idea.

Most of Kaitai-gendrated parsers need some postprocessing, so I guess that the tests should go into the libraries' repo.

Postprocessing? Generally, most of .ksy files available in kaitai_struct_formats repo is ready to be used as is.

1 main repo should incorporate standalone repos as submodules

I'm not sure if that's a good idea. Generally, end-users are not that interested in all that gory details of .ksy development, they just need a finished product. So, for now I'm proposing to just copy "finished product", i.e. .ksy file to KSF repo from time to time.

2 in standalone repos the addressation should assume the main repo as global root to be able to use the formats published there.

All formats universally can assume that, it's not the matter of "standalone repo", or just some random .ksy file in random path.

KOLANICH commented 7 years ago

main .ksy should be placed in root directory of a repository, additional .ksy files should be placed there too (if needed).

I suggest to have no main ksy. Which is main - it is decision made by the library developer. The layout of the repo is the one its owner decides. It can contain all the files in a single folder, it can contain a hierarchy.

What exactly are "unspecific files"?

The files which are likely to be useful for formats other than the ones in this repo. Building blocks. The examples are now in common folder of kaitai_struct_formats.

Postprocessing? Generally, most of .ksy files available in kaitai_struct_formats repo is ready to be used as is.

Yes, they are, but in most cases it is useful to have a library doing the stuff. For example, in most cases we want spectrum data be loaded as a numpy array, not a python one. So we need some code checking for presence of numpy and using numpy if it is available. In specpr format we need to glue some adjacent records to get continious spectrum data. The postprocessing library does this. Since we have to test that library anyway and since the library will malfunction if ksy is wrong, I think the tests in the library repo will be enough.

Generally, end-users are not that interested in all that gory details of .ksy development, they just need a finished product. So, for now I'm proposing to just copy "finished product", i.e. .ksy file to KSF repo from time to time

The repo's branch master should be a finished product by convention. A bleeding-edge but a finished one. For unfinished ones we can have separate branches.

All formats universally can assume that, it's not the matter of "standalone repo", or just some random .ksy file in random path.

It is the convention.

KOLANICH commented 7 years ago

Do you have any thoughts on whether this would be a suitable test approach? This kind of approach is commonly used to test relational database server software and/or schemas in use by various applications.

Yes, I have thought about this (but with BSON instead of text-based format since we are dealing with binaries and since it allows some extension).

davidhicks commented 7 years ago

How will this model work with https://ide.kaitai.io ?

If someone wants to enhance or use an existing file format specification, will the import statements be able to load the correct .ksy files? And if the user wants to add extra .ksy files for new components of the file format specification, will these be picked up and work with the import statements correctly?

koczkatamas commented 7 years ago

That part of the WebIDE will be completely rewritten. I don't know yet when, but on the devel branch there is already an implementation where you can directly load ksys from your custom Github repo and will be able to commit the changes.

We will solve the import question too, probably there will be a search order of data sources and the first one where the file is found will be used.

GreyCat commented 7 years ago

I've hacked a quick recursive dumper that allows to dump to YAML, JSON and XML (more or less standard pretty-printer calls of standard libraries). It is available now as ksdump, inside ksv visualizer repo (it reuses some of visualizer code to run compile & load data).

Some results for you to compare:

quake_mdl.ksy with quaddama.mdl: as JSON, as XML, as YAML
cdr.ksy with cdr7_no_compr.cdr: as JSON, as XML, as YAML
gif.ksy with wheel.gif: as JSON, as XML, as YAML

GreyCat commented 7 years ago

A few things I want to note:

This dumper fails badly with stack overflow with infinitely recursive structures (for example, iso9660.ksy)
File sizes:
- 16677 — wheel.yaml
- 27075 — wheel.json (~1.6x to YAML)
- 46994 — wheel.xml (~2.8x to YAML)
YAML wraps long lines with large raw hex dumps, XML and JSON don't do that
XML is very verbose and noisy (i.e. there is no way to specify data type except for adding extra "type"-like attribute), also it mangles our lower_underscore_case into lower-minus-case
JSON has diffing problem: adding another attribute would likely replace trailing } with },, thus making extra diff line, but, arguably, it is most legible out of these three choices
Default libyaml output is somewhat ugly :(
There is no simple way to control order of output, beside digging deep into YAML/XML/JSON library and making our own low-level serialization routines. My current try is supposed to be alphabetical (to keep it stable), but even that is actually not guaranteed :(

KOLANICH commented 7 years ago

4 мая 2017 г. 19:33:18 GMT+03:00, Mikhail Yakshin notifications@github.com пишет:

A few things I want to note:

This dumper fails badly with stack overflow with infinitely recursive structures (for example, iso9660.ksy)

File sizes:

16677 — wheel.yaml

27075 — wheel.json (~1.6x to YAML)

46994 — wheel.xml (~2.8x to YAML)

YAML wraps long lines with large raw hex dumps, XML and JSON don't do that

XML is very verbose and noisy (i.e. there is no way to specify data type except for adding extra "type"-like attribute), also it mangles our lower_underscore_case into lower-minus-case

JSON has diffing problem: adding another attribute would likely replace trailing } with },, thus making extra diff line, but, arguably, it is most legible out of these three choices

Default libyaml output is somewhat ugly :(

There is no simple way to control order of output, beside digging deep into YAML/XML/JSON library and making our own low-level serialization routines. My current try is supposed to be alphabetical (to keep it stable), but even that is actually not guaranteed :(

There is no simple way to control order of output

Arrays are guarrantied to be ordered.

BTW, how about BSON?

GreyCat commented 7 years ago

Arrays are guarrantied to be ordered.

Of course.

BTW, how about BSON?

BSON would be probably worse than these 3: you're more or less unable to view BSON without special tools, there are virtually no BSON diffing utilities (and, even if you'd come up with one, GitHub won't know a thing about diffing BSONs), it's even less widespread and understood than XML/YAML/JSON, etc.

GreyCat commented 7 years ago

After playing with YAML dumping for some time, I've realized that YAML is not immune to diff problems either. Say, something like that:

- bar: 1
  foo: 2

will get transformed into

- aaa: 0
  bar: 1
  foo: 2

if we're adding aaa key (and it gets to be first due to ordering). 2 lines diff, while actually only 1 attribute was added.

milahu commented 1 year ago

For testing of a .ksy specification against sample files, one approach could be to convert an input binary file into an XML document, YAML document or similar which can then be compared with a common tool such as diff to against previous known test results.

aka snapshot testing

see also gron - make json greppable (and diffable)

main .ksy should be placed in root directory of a repository, additional .ksy files should be placed there too (if needed).

I suggest to have no main ksy. Which is main - it is decision made by the library developer. The layout of the repo is the one its owner decides. It can contain all the files in a single folder, it can contain a hierarchy.

add a /manifest.json for kaitai-struct-compiler?

{
  "type": "kaitai_struct_format",
  "name": "java_bytecode",
  "version": "0.1",
  "main": "src/java_bytecode.ksy",
  "import_paths": [
    "https://github.com/kaitai-io/kaitai_struct_formats"
  ],
  "scripts": {
    "test": "./tests/run.sh"
  }
}

... or keep ksy files in https://github.com/kaitai-io/kaitai_struct_formats and only out-source test files, to keep the main repo small

meta:
  id: some_format
  tests: https://github.com/kaitai-io/kaitai_struct_formats_some_format_tests

KOLANICH commented 1 year ago

keep ksy files in https://github.com/kaitai-io/kaitai_struct_formats

I currently do this. IMHO it is the most convenient to just fork KSF and develop formats in it. And then regularly rebase over master. This way my specs get the updates from all the dependencies. Of course even better is to merge the specs into the upstream.

https://github.com/kaitai-io/kaitai_struct_formats_some_format_tests

There exists https://github.com/kaitai-io/kaitai_struct_samples , but yes, if we put all the files there, it soon will be pretty big. So if we centralize samples in a single repo, I guess Git LFS is a necessity. But LFS is a feature of repos which usage is limimed. Though I guess on HuggingFace.co Git LFS is a "free" feature, I'm not sure if it is not abuse to use that website for anything other than machine learning models.

add a /manifest.json for kaitai-struct-compiler?

I guess not JSON, but YAML, since we already use YAML.

"name": "java_bytecode",

Nope, each ksy spec contains a meta.id.

"import_paths" "main": "src/java_bytecode.ksy",

Surely needed. My kaitaiStructCompile.py parses analogues of all this from own sections in pyproject.toml.

"scripts": {
  "test": "./tests/run.sh"
}

I guess, no. Maybe

samples: "dir/with/tests/relative/to/repo/root

Samples should contain subdirs with names matching specs ids. Each subdir should contain a test suite in in FileTestSuite format, mapping binary files to their serializations.

or

samples:
  repo: "https://github.com/kaitai-io/kaitai_struct_samples"
  refspec: "master"
  path: "dir/with/tests/relative/to/repo/root"

if samples live in a foreign repo.

milahu commented 1 year ago

"scripts": {
  "test": "./tests/run.sh"
}

my idea was that some binary files can be generated, for example: sqlite database, ext2 filesystem, png image - also useful for fuzz testing, or for generating extremely large files (for testing limits and performance)

risk: malicious code hidden in complex test script

so yeah, its easier to have a test corpus of static files and expected results and store test scripts in a separate location

KOLANICH commented 1 year ago

my idea was that some binary files can be generated

Good idea, really useful for fuzzing. But I guess for this case the script should not be called by test runner ... instead test runner should be called by the script. Kinda

# paths in the metadata point to non-existent files
./generateTestFiles ./ks-repo.yaml # generates the test files by the needed path, after it the paths in the metadata point to existent files
kaitai-test-runner ./ks-repo.yaml --junitxml=result.xml

For the most of big testing files I guess it can be possible to use sparse files (the ones taking less space on disc than they appear to have, holes are filled with zeros, some legacy file systems don't support them, but NTFS (I guess without some old Windows versions, but cannot exclude they can be implemented the same way symlink support to them was implemented), APFS, ext4, btrfs and zfs do. Git supports them and automatically sparsifies files that were not explicitly sparsified with fallocate.

generalmimon commented 8 months ago

@KOLANICH:

There exists https://github.com/kaitai-io/kaitai_struct_samples , but yes, if we put all the files there, it soon will be pretty big. So if we centralize samples in a single repo, I guess Git LFS is a necessity.

My idea (maybe naive and impractical, but whatever) was that in 99% of formats, you don't need huge binaries to test all structures defined in a .ksy file, i.e. headers, different chunk types, etc. Usually, several compact files with a few KiB per format is enough for that (and Git can absolutely handle a lot of these out of the box, without the need for Git LFS or anything).

In most cases, a big file doesn't bring any benefits in terms of "code coverage" (or format structure coverage) over a small one. Usually, it's only big because either:

most of the file contents is a huge opaque byte array that means nothing to the Kaitai Struct specification/parser, or
it contains a lot of repetition, meaning that if the format has chunk A and chunk B with different structure (that you both want to test; ideally, we don't want any uncovered places in any .ksy spec), the big file contains 1M instances of chunk A and 2M instances of chunk B, whereas a smaller file has only several instances of chunk A and B and does an equally good job.

That's in theory. In practice, unfortunately, trying to get these "small" high-quality files may be difficult, if not impossible. Many samples we can take from the internet will not uphold this criteria, and generating/crafting own files requires a lot of knowledge and time. So I understand the simplicity of adding whatever file we can find regardless of how big it is into the collection, but yeah, the Git repo may become unusable over time after adding a bunch of larger files.

KOLANICH commented 8 months ago

Usually, several compact files with a few KiB per format is enough for that (and Git can absolutely handle a lot of these out of the box, without the need for Git LFS or anything).

The idea was that GIT LFS stores hashes within repo, and files are stored separately. It allows a git repo itself to be small, and files to be stored and populated separately. But to think more about this, sparce checkout should also do the trick. But sparse checkout and partial clones have terrible UX making them unusable anywhere where the commands have to be typed manually.

generalmimon commented 8 months ago

@KOLANICH:

But LFS is a feature of repos which usage is limimed.

It looks like it - and quite severely, at least on GitHub. From https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage I understand that if you have a 10 MiB file as a Git LFS object and 103 downloads of this file within a month, you've run out of the monthly free bandwidth and Git LFS stops working until the next month.

generalmimon commented 8 months ago

So using Git LFS on GitHub Free doesn't seem to bring any advantages - it's just more limited. Normally, you can have files up to 100 MiB:

GitHub blocks files larger than 100 MiB.

and it's strongly recommended to keep the repository below 5 GiB:

We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended.

Although Git LFS technically gives you an ability to store files larger than 100 MiB, in practice you can't do this anyway, because one such file means that the repository can only be cloned 10 times per month.

kaitai-io / kaitai_struct

"Big formats" model #165

Postprocessing libraries