kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.96k stars 192 forks source link

IDA support #31

Open KOLANICH opened 7 years ago

KOLANICH commented 7 years ago

IDA allows you to mark parts of binaries as data, code, structs, enums, etc. Some binaries have structs/tables in them, for example PnP Expansion Header or PE format, which can be described using katai-struct language. It'd be nice to generate idc/idapython scripts from ks description to automatically parse tables.

GreyCat commented 7 years ago

Could you show some examples of what should be generated? idapython probably won't be hard to implement.

athre0z commented 7 years ago

IMHO that doesn't really fit together. IDA has a type system allowing users to interactively build C structs and apply them to data in the binary, but the Kaitai format features a more complex grammar (e.g. context sensitive expressions) and those can't be translated to reusable IDA structs. It'd be possible to apply Kaitai to a specific region of data in the binary, generating IDA structs on the fly, resolving complex expressions to constants, applying those structs to the disassembly, but that'd flood the IDB with redundant (and incompatible, e.g. when using the HR decompiler) struct and pointer types, essentially making this useless.

It might be interesting for generating file loader plugins, however those usually require some meta info (like what data is code, segments, ...), so it'd probably be easier to use the regular C++ / Python Kaitai compiler and manually build a loader based on those. It might be an idea to simply provide C++ / Python IO implementations that operate on the IDA SDK loader IO functions to support file loader development.

KOLANICH commented 7 years ago

but that'd flood the IDB with redundant (and incompatible, e.g. when using the HR decompiler) struct and pointer types, essentially making this useless.

Why do you think so? It is possible to reuse types. KS creates the code, which applies the struct to a fixed offset or searches a file for a piece satisfying constraints. Then the code knows offsets of substructs and applies substructs to their offsets. Ordinar breadth-first search. All the structs are typed so it is possible to to create corresponding structs in IDA and reuse them.

it'd probably be easier to use the regular C++ / Python Kaitai compiler and manually build a loader based on those.

bad idea, we deal not with data, but with metadata, we need to mark binary, not just read it. So we have to use ida functions which implies we cannot use c++ structs (in fact we can, ida can parse c++ structs into own structs, but this works (or worked, haven't used the latest version) not very good).

GreyCat commented 7 years ago

All the structs are typed so it is possible to to create corresponding structs in IDA and reuse them.

As far as I know, IDA structure definitions are close to C structs, so they support much less than KS structs can deliver. For example, I believe that it's impossible to create C struct and IDA type for something like that:

seq:
  - id: my_len
    type: u4
  - id: my_str
    type: str
    size: my_len
    encoding: UTF-8

r2 support is the whole other world, and we've already had big discussion with r2 team members, and some support code would (hopefully) be coming soon.

KOLANICH commented 7 years ago

As far as I know, IDA structure definitions are close to C structs, so they support much less than KS structs can deliver. For example, I believe that it's impossible to create C struct and IDA type for something like that

if I remember right, pascal strings is what you mean. IDA has pascal strings type.

And if you can't express something you can just create underlying datatypes in that places and give them meaningful names.

GreyCat commented 7 years ago

I've found that IDA indeed introduced some support for custom data types — may be it would be of some help:

Anyway, all this talk is pointless without some solid examples to illustrate the idea. @KOLANICH, you've probably seen https://github.com/kaitai-io/kaitai_struct/wiki/Adding-support-for-new-target-language — could you show how you envision that hello_world.ksy should be represented in IDA target?

athre0z commented 7 years ago

These custom data types aren't well supported in IDA — they only work with the old "Structures" view (that hardly anybody uses nowadays when there is the "Local Types" C-style type system) and are translated to char[] in the "Local Types" view. Also, custom types can only be applied to the last field of a struct, forcing you to split up structs each time you encounter a type that requires logic that exceeds what C structs provide. They most probably won't work with the decompiler either (I don't have an IDA installation with decompiler at hand to test right now).

One could do what @KOLANICH suggested; to generate reusable structs for everything that can be translated to POD C structs and just rename stuff you can't for an invocation on a specific piece of data. Sure, that works, but it isn't especially useful unless you're coercing IDA into being a hex editor / data viewer (and IDA sucks at being a hex-editor, you can't even expand/collapse nested structs nicely).

When you use IDA, you're usually primarily interested in the code and the way the code interacts with data. Unless it's .rdata, such data structures are never accessed directly, but relatively, from the base offset of a more complex structure, often having offsets added up in control structures like loops, so these names won't help you anything. What you need are the offsets to apply to operands to make the disassembly readable, which is possible when IDA knows the C struct. If it's .rdata, compiler and architecture specific alignment applies, so you would have to teach KS scaling integer types first, which isn't something that makes a lot of sense for regular file formats and is usually only relevant for in-memory data.

I think just providing a mechanism to generate these reusable POD structs in IDA would be the best solution if you guys really want this IDA support. There are multiple ways to do so, the easiest being to just generate C structs into a big Python / IDC string, importing them using idaapi.parse_decls.

KOLANICH commented 7 years ago

could you show how you envision that hello_world.ksy should be represented in IDA target?

Something like

auto  hndl;
hndl = GetStrucIdByName("hello_world");
AddStrucMember( hndl, "one", 0, FF_BYTE, -1, 1);

hello_world_counter=0
static ksMk_hello_world(long offset) {
    auto res;
    MakeNameEx(offset+0, "hello_world_"+ltoa(hello_world_counter,16), SN_CHECK  | SN_PUBLIC | SN_AUTO);
    res=MakeStructEx(offset+0, -1, "hello_world");
    hello_world_counter++;
    return res;
}
//ksMk_hello_world(0);

but it isn't especially useful unless you're coercing IDA into being a hex editor

Maybe. I don't have deep knowledge of IDA, maybe it can be scripted to add support for custom binary formats (the ones we see when select architecture and format : pe, elf, coff, etc).

for

seq:
  - id: my_len
    type: u4
  - id: my_str
    type: str
    size: my_len
    encoding: UTF-8

it should gen something like these (but with checks for errors):

static my_string_counter=0
static ksMk_my_string(long offset) {
    auto res;
    MakeNameEx(offset+0, "my_string_"+ltoa(my_string_counter,16)+"_my_len", SN_CHECK  | SN_PUBLIC | SN_AUTO);
    MakeData(offset+0, FF_QWRD, -1, -1);
    MakeNameEx(offset+4, "my_string_"+ltoa(my_string_counter,16)+"_my_str", SN_CHECK  | SN_PUBLIC | SN_AUTO);
    //MakeArray(offset+4, Word(offset+0));
    MakeStr(offset+4, offset+4+Word(offset+0));
    //check here that the string is in utf-8
   my_string_counter++;
   return res;
}
//ksMk_my_string(0);

Note that I haven't tested this code, so it can be incorrect.

In the first case we used the struct. In the second case because there was only a variable it doesn't worth to create a struct so we created a variable directly. Every code creating a data struct is wrapped into a function for easier reuse and better code structure.

Why not to use idapython and the mentioned features? It'd be cool, but I have no experience with it at all so I don't think I can create a good example just now, maybe someone in this thread can.

GreyCat commented 7 years ago

@KOLANICH Thanks, now we have something to work with :)

The next big question we need to discuss: any ideas how would we test that? I believe that ideal test would be to run the actual IDA, load our sample binaries, run these scripts, and then export IDA database somehow into some text-like form, to compare with expected result. But that, at the very least, is hindered by the fact that we can't run IDA at Travis easily. There is a demo available, but probably it can't run these scripts anyway, and "you will not be able to save your work" probably means that you can't export anything either.

KOLANICH commented 7 years ago

But that, at the very least, is hindered by the fact that we can't run IDA at Travis easily.

I guess we can install Wine. But it seems it is not possible to install ida pro there legally. And it doesn't worth to redo idc interpreter because idc is replaced with idapython almost entirely. But I guess it would be not that hard to create a mock for used function of idapython from python ks runtime. So the ks should translate ks not to idc but to idapython. Then we parse specially crafted files with both python-generated code and ida generated code and verify that they have returned the same. The problem is that ida is targeted to do different thing than ks and can behave differently than python ks runtime, which should be implemented in the mock.

export IDA database somehow into some text-like form

It is possible, IDA definitely has idc API for that.