New built-in type: opcode

KOLANICH commented 7 years ago

An opcode is just a byte of machine code of any architecture intended to run directly by underlying hardware. It is implementation-defined how to treat it. Virtual machines' opcodes are not covered by this and should be treated separately. We need this type in order to support disassemblers' and decompilers' APIs as targets.

GreyCat commented 7 years ago

Opcodes are just sequences of bytes. Why would you need a special built-in data type for them? I've already done several opcode disassemblers in current paradigm without any major problems.

KOLANICH commented 7 years ago

Because we need to distinguish between data bytes (which shouldn't be marked as code because they are not and the code will be garbage if they are disassembled) and code bytes (which should be passed to a disassembler or called instead of parsing in most cases).

For example if we are reverse-engineering some binary of known format. What does any binary have? It has entry points. And we want a decompiler to be called automatically on the code referenced by entry points.

GreyCat commented 7 years ago

General pattern to disassemble opcodes is something along the lines of:

seq:
  - id: opcodes
    type: opcode
    repeat: eos
# Simplest case, 1 header byte defines instruction
enums:
  code:
    0x02: add
    0x03: sub
    0x08: jmp
types:
  opcode:
    seq:
      - id: code
        type: u1
        enum: code
      - id: args
        type:
          switch-on: code
          cases:
            add: two_addresses
            sub: two_addresses
            jmp: one_address
  one_address:
    seq:
      - id: addr
        type: u4
  two_addresses:
    seq:
      - id: addr1
        type: u4
      - id: addr2
        type: u4

What's wrong about it, and what do you propose in place of this?

KOLANICH commented 7 years ago

The wrong is that it doesn't makes IDA to disassembly it. I mean that when we open a image in IDA we don't want to disassemble image ourselves, we want the disassembler do the job. opcode type is needed just to make a recursive descent disassembler to start disassembly from that offset.

GreyCat commented 7 years ago

So, basically, you're proposing some reserved type name that would be valid only for IDA output? I guess that could be done. Actually, you can specify any non-existing type name right now and that will generate a call to a constructor of that type, assuming that it exists. Then you can either plug a KS-generated class with that name, or write your own (that will call IDA disassembly or whatever). Would that work for you?

KOLANICH commented 7 years ago

you're proposing some reserved type name that would be valid only for IDA output

For any scriptable-enough disassembler.

Actually, you can specify any non-existing type name right now and that will generate a call to a constructor of that type, assuming that it exists.

We can use custom type name for that, but it can become a mess when different decompilers use different type names, which means we have to generate a ks-file to every decompiler separately. So we need to reserve a type name for this purpose and every backend generating disassembler-specific scripts should treat that name as code. Since we reserved a type name there is no reason not to make that type built-in.

GreyCat commented 7 years ago

For any scriptable-enough disassembler.

Well, so far there are none of them listed as the targets, and even IDA support is pretty vague.

So we need to reserve a type name for this purpose and every backend generating disassembler-specific scripts should treat that name as code.

It's pointless to reserve anything right now, until we'll know more details about it. Invoking disassembler properly might need lots of extra parameters. At the very least, I might imagine that one might want to pass CPU / architecture / endianness config. Sometimes it's useful to pass symbols, debugging information, or something like that to disassembler, etc, etc. If you want some standard here, it should probably support all that stuff.

KOLANICH commented 7 years ago

Invoking disassembler properly might need lots of extra parameters.

Yes, but KS compiler have nothing to do with them. It's disassembler liability to disassembly the binary, not KS. KS-generated code in that context is only used to partition the binary into zones to help the disassembler and decompiler to make use of its already known structure to produce the code more meaningful to a human. In fact you can even apply the ks-compiled script first, check if it has matched (signature + constraints) some piece of binary, if it has read some parsed value, deduce the architecture from that value and ask the disasm to disasm. This is because we don't need to know all the architecture parameters to parse the file, we only need to know the endianness and the structure description from the docs.

For example if a one is reverse engeneering a firmware dumped from a memory chip. It doesn't have format recognized by most of disassemblers: it is neither coff, nor elf, nor mach-o, nor any other format the disasm devs have implemented an import script for. Instead of it it is of some documented format described in a datasheet or specification. Every executable format has one or more entry points from which the disassembler must start disassembly. These entry points are usually organized into a structure. The examples are interrupt tables, boot sectors, optionrom and PnP headers. There can be other documented tables in the binary for example the ones controlling peripherials. What do I want? In brief I want an import script to be generated from ksy. I want a disassembler to parse the tables, to visit every entry point it finds and disassembly it and every piece of code reachable from it.

GreyCat commented 7 years ago

Most often than not, executable binary format (like ELF, PE, Mach-O, etc) stores not only partitioning / section / segments table, but also CPU, endianness and tons of other stuff to be used for disassembly. If you say that it's "disassembler liability" to derive these from thin air, then you just as well say that it's disassembler's liability to parse "paritioning". It's pretty strange to parse only partitioning, ignoring everything else.

Instead of it it is of some documented format described in a datasheet or specification.

I'm by no mean an IDA (or any other disassembly) expert, but 100% of cases I've seen so far this has almost nothing to do with file formats. Current KS format is not really suited to describe the memory map. Memory map usually carries much more semantics — i.e. you may have some crazy addressing modes (like bank switching), volatile memory cells (which are connected internally or virtually to some ports), you have interrupt vectors (which you'd expect to declare as array of pointers — and KS has no concept of pointer so far), etc, etc.

Besides, we're probably talking of yet another compile target here: I doubt that the code which sets up segments / section table for IDA would be the same that just marks up some primitive type regions.

Once again, if I were you, I'd start with some proof-of-concept code that you want to get, and it will show if that's a viable idea or not. Probably something like this bflt example is a good point to start.

kaitai-io / kaitai_struct

New built-in type: opcode #66