kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.98k stars 194 forks source link

Parsing from multiple files #125

Open mnakamura1337 opened 7 years ago

mnakamura1337 commented 7 years ago

I'm working on reverse engineering of a container that essentially consists of 2 files:

The only way to access data in second file is to use the first file. However, if I'm not mistaken, Kaitai Struct does not allow to access extra files while parsing. So, I want to propose something like:

seq:
  - {id: filename, size: 16, type: str}
  - {id: offset, type: u4}
  - {id: len, type: u4}
instances:
  file_body:
    filename: '"body.dat"' # expression language for flexibility
    pos: offset
    size: len

In fact, I've encountered quite a few of such multi-file formats in last few months or so. I believe it would be a very useful addition to Kaitai Struct.

GreyCat commented 7 years ago

The first major problem would be that "file" concept does not exist on some platforms, i.e. JavaScript. What do we do with that?

KOLANICH commented 7 years ago

Parsing from multiple files

I have thought about the same for .cab format parser. I guess we don't need this built-in, we shouldn't put everything in the world in KS. We need some way to extend KS with own plugins. The discussion worths a separate issue.

mnakamura1337 commented 7 years ago

The first major problem would be that "file" concept does not exist on some platforms, i.e. JavaScript

Node.JS includes tons of file reading functions. We can just skip implementation of browser-compatible code for now. You're not even testing it in browsers, as I believe.

GreyCat commented 7 years ago

Generally, it all boils down to creation of a new KaitaiStream instance for an instance (akin to io: XXX) that will open and start reading from a new file. I'm not really sure about filename: ... though. I'm pondering the idea of some factory method to create new instances of stream from files, thus we could reuse it everywhere, i.e. io: local_file("body.dat") (name and syntax chosen totally randomly, of course).

jfenal-zz commented 5 years ago

I would second Nakamura-san's request: Adabas databases rely on multiple files, some being text, describing the fields, others binary (index and data). I still need to have a local instance of kaitai to work for me and give it a try without, but those files seem to be quite related.

savagesteel commented 4 years ago

I'm also interested by being able to parse multiple files in Kaitai structs for the following formats:

https://github.com/savagesteel/d1-file-formats

burner1024 commented 1 year ago

Is this still the case? No way to define and parse linked/connected file formats together?

generalmimon commented 1 year ago

@burner1024:

No way to define and parse linked/connected file formats together?

Actually, there is a way. The main idea is to pass the streams of related files via top-level type: io parameters and then you typically use instances with the io key to parse anything you want from that stream (see Absolute positioning in the User Guide; here it's demonstrated on the typical use case of parsing from the root stream, but it'll work with an arbitrary stream just as well).

For reference, see https://github.com/kaitai-io/coreldraw_cdr.ksy that uses this trick. As explained in the README there, unfortunately it won't be possible to use a .ksy spec with top-level type: io params in visualizers, because they don't provide any way to pass the streams. To work around that in coreldraw_cdr.ksy, I wrote a simple Bash script bin/cdr-unpk, which dumps the contents of all required files in a custom "archive" format described in cdr_unpk.ksy, which becomes the new entrypoint. This enables full use of coreldraw_cdr.ksy in visualizers even though it depends on external files. However, when using the generated parser in an application, there's no problem to specify values for the top-level parameters and it's of course easier than going through the auxiliary dump format like cdr_unpk.ksy, so it's better to use the variant of coreldraw_cdr.ksy that has the top-level type: io[] parameter (and avoid cdr_unpk.ksy altogether) - see Standalone use of coreldraw_cdr.ksy from your application.

burner1024 commented 1 year ago

Thanks! I'll try that.