kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4k stars 194 forks source link

Container special type #340

Open KOLANICH opened 6 years ago

KOLANICH commented 6 years ago

There are some container formats. The easiest example is an archive: it contains files of any type. We may want to indicate that the type of the chunk of memory is any format which should be parsed. It may be useful for some applications like kaitai-powered binwalk parsing everything in the file.

So I propose to add a special built-in type which instantiation should pass the control flow to an algorithm which will try to match the blob against all the signatures (see #225) in the library, if it matches - tries to parse it, if it parses (including passing all the checks #81) - then assumes that the format is guessed correctly.

Obviously it will require RTTI in C++.

I don't know any good name for this kind of type. container maybe, or signature_matcher, but I don't really like the mentioned names.

GreyCat commented 6 years ago

An interesting idea! My proposals for type name would be type: autodetect or type: auto.

A practical implementation is probably pretty far away, though. Not only "signature only" checks and generla format validation framework are needed, but you'll need to build and maintain some sort of repository of "autodetectable file formats", I guess, and include them all into a parser that will use this feature. Also, there should be some way to actually disable this "deep auto-detect" parsing, as vast majority of people who research container formats are not interesting in parsing of JPEG/PNG/MP3/whatever files are insides, they're perfectly ok with exporting them as is (and using them later with standalone software). Last, but not least, probably it's worth implementing full lazy parsing first: #133.

KOLANICH commented 6 years ago

The repository of autodetectable formats is kaitai_struct_formats (or any other by user's choice). If this feature is enabled, the compiler scans the repository, finds all the signatures, builds a mapping signature->ksy file, generates a finite automata for matching a signature and emitx the code into a separate file. This kind of behavior should be disabled by default and of course there should be a hook to redefine it into something scanning a dir for modules in runtime. If this feature is disabled the containers are just blobs.

Go disable autodetecting of some formats we can use #339.

I have thought about auto ... but it is a keyword in C++ (I know about the difference in case) and IMHO is too short.

GreyCat commented 6 years ago

but it is a keyword in C++

So, that's perfect :) We don't need to generate anything named "auto" for that.

and IMHO is too short.

That's totally ok too :) Keywords and predefined types shouldn't be too long.