(Info) Instruction operand types and determining their semantics

Hiya. I'm the dever of the ida-minsc plugin and just heard about your project. I'm glad to see that people are comming to a realization about how much IDAPython sucks. Anyways, just wanted to point out some things about ida's instruction operands since they appear next in your todo for the operand module, and they're super-undocumented because I believe they're each specific to the processor module that's used for disassembling

Grabbing the operand semantics are generally pretty straightforward on the risc architectures as they're in one of the attributes of the op_t. These indexes (such as in op_t.reg) are referencing the list in idaapi.ph_get_regnames(), or whatever wrapper you prefer using. For numerical registers (such as ST(4), etc), the value in op_t.reg typically represent just the numerical part of the register.

Intel

idaapi.o_phrase|idaapi.o_displ: op_t.specflag1 contains an enumeration essentially, and op_t.specflag2 contains masks for your different components. In at&t syntax, your phrases/displ look like offset(base, index, scale). I have the values for specflag1 listed at https://github.com/arizvisa/ida-minsc/blob/master/base/instruction.py#L1428. But for identifying the different components, specflag2 & 7 will contain the base-register, and specflag2 & 0x38 is for the index register. The 2-bits for specflag2 & 0xc0 represent the scale (1, 2, 4, 8). op_t.addrthen simply the offset.

AArch

idaapi.o_phrase : Rn is in op_t.reg, op_t.addr contains the offset.
idaapi.o_idpspec0 (trap) : op_t.value is your simply your index.
idaapi.o_idpspec1 (list) : op_t.specval is essentially a bitmask of flags where each index corresponds to whether a register is included in the list or not. Each index of the integer maps to the register names.
idaapi.o_idpspec4 (extlist) : op_t.value contains an enumeration that specifies D8, or D8-D9, etc.
idaapi.o_idpspec5+1 (condition): It seems that op.value, op.reg, and op.n are relevant, but I haven't fully done this one yet.

If you discover any others, I'd be interested in hearing about them and I'm sure the Sark author will as well.

Hello!

Thank you for getting in touch, greatly appreciated :)

Yes, the operand is indeed in my todo-list, it is actually one of the first things I did and it clearly needs to be remade. One of the problems I have with this is that, right now, there is nothing in Bip which is dependent of the processor/architecture. I have not yet decided on how to implement that and in all cases it will take some times. If you have some inputs on how to do that, I would be glad to hear them.

For the usage of the op_t members/optype_t enum, here is what I think I understand.

Intel

For Intel, from what I gather, the specflag1 indicate if the Scaled Index Byte (SIB) is used while the specflag2 is directly the value of the SIB (see Intel Manual Vol. 2 Chapter 2 "Instruction Format" for the encoding). This seems to match what you are describing.

Here is a dump of some code from the Intel sdk include/intel.hpp which seems to match:

#define segrg           specval_shorts.high
#define SEGREG_IMM      0xFFFF          // this value of segrg means that
                                        // segment selector value is in
                                        // "segsel":
#define segsel          specval_shorts.low
#define hasSIB          specflag1
#define sib             specflag2
#define rex             insnpref        // REX byte for 64-bit mode, or bits from the VEX byte if vexpr()

// Op6 is used for opmask registers in EVEX.
// specflags from Op6 are used to extend insn_t.
#define evex_flags      Op6.specflag2   // bits from the EVEX byte if evexpr()

#define cr_suff         specflag1       // o_crreg: D suffix for cr registers (used for CR8D)

// [...]

inline int sib_base(const insn_t &insn, const op_t &x)                    // get extended sib base
{
  int base = x.sib & 7;
#ifdef __EA64__
  if ( insn.rex & REX_B )
    base |= 8;
#else
  qnotused(insn);
#endif
  return base;
}

inline regnum_t sib_index(const insn_t &insn, const op_t &x)                   // get extended sib index
{
  regnum_t index = regnum_t((x.sib >> 3) & 7);
#ifdef __EA64__
  if ( (insn.rex & REX_X) != 0 )
    index |= 8;
#endif
  if ( is_vsib(insn) )
  {
    if ( (insn.evex_flags & EVEX_V) != 0 )
      index |= 16;
    index = vsib_index_fixreg(insn, index);
  }
  return index;
}

inline int sib_scale(const op_t &x)
{
  int scale = (x.sib >> 6) & 3;
  return scale;
}

// [...] This is follow by other helpers

AArch64

For ARM64, I am sadly not as familiar with the architecture, but in the idasdk the file module/arm/arm.hpp seems to match what you describe with some more precision. Here is the code I think you will find relevant (I am not sure if it is the real implementation):

// Operand types:
#define o_shreg         o_idpspec0         // Shifted register
                                           //  op.reg    - register
#define shtype          specflag2          //  op.shtype - shift type
#define shreg(x)        uchar(x.specflag1) //  op.shreg  - shift register
#define shcnt           value              //  op.shcnt  - shift counter

#define ishtype         specflag2          // o_imm - shift type
#define ishcnt          specval            // o_imm - shift counter

#define secreg(x)       uchar(x.specflag1) // o_phrase: the second register is here
#define ralign          specflag3          // o_phrase, o_displ: NEON alignment (power-of-two bytes, i.e. 8*(1<<a))
                                           // minimal alignment is 16 (a==1)

#define simd_sz         specflag1          // o_reg: SIMD vector element size
                                           // 0=scalar, 1=8 bits, 2=16 bits, 3=32 bits, 4=64 bits, 5=128 bits)
                                           // number of lanes is derived from the vector size (dtype)
#define simd_idx        specflag3          // o_reg: SIMD scalar index plus 1 (Vn.H[i])

// o_phrase: the second register is held in secreg (specflag1)
//           the shift type is in shtype (specflag2)
//           the shift counter is in shcnt (value)

#define o_reglist       o_idpspec1         // Register list (for LDM/STM)
#define reglist         specval            // The list is in op.specval
#define uforce          specflag1          // PSR & force user bit (^ suffix)

#define o_creglist      o_idpspec2         // Coprocessor register list (for CDP)
#define CRd             reg                //
#define CRn             specflag1          //
#define CRm             specflag2          //

#define o_creg          o_idpspec3         // Coprocessor register (for LDC/STC)

#define o_fpreglist     o_idpspec4         // Floating point register list
#define fpregstart      reg                // First register
#define fpregcnt        value              // number of registers; 0: single register (NEON scalar)
#define fpregstep       specflag2          // register spacing (0: {Dd, Dd+1,... }, 1: {Dd, Dd+2, ...} etc)
#define fpregindex      specflag3          // NEON scalar index plus 1 (Dd[x])
#define NOINDEX         (char)254          // no index - all lanes (Dd[])

#define o_text          o_idpspec5         // Arbitrary text stored in the operand
                                           // structure starting at the 'value' field
                                           // up to 16 bytes (with terminating zero)
#define o_cond          o_idpspec5+1       // ARM condition as an operand
                                           // condition is stored in 'value' field

// The processor number of coprocessor instructions is held in cmd.Op1.specflag1:
#define procnum         specflag1

// bits stored in specflag1 for APSR register
#define APSR_nzcv       0x01
#define APSR_q          0x02
#define APSR_g          0x04
// for SPSR/CPSR
#define CPSR_c          0x01
#define CPSR_x          0x02
#define CPSR_s          0x04
#define CPSR_f          0x08
// for banked registers (R8-R12, SP, LR/ELR, SPSR), this flag is set
#define BANKED_MODE     0x80 // the mode is in low 5 bits (arm_mode_t)

The module folder from the idasdk also contains some information for other processors.

Ahh. Okay. Awesome. Looks like I have a couple more operand types I can now properly implement as I really didn't know about these details since I first had to figure it out during the 6.x series. ;-)

(apologies for the verbosity, especially if there's a language difference)

On-demand versus Hooking

In terms of ways to implement them, you might've already thought of these methods but I'll let you know why I chose my methodology in particular. So the two main ways to distinguish registers and operands are by using hooks (IDP_Hooks.ev_newprc) to detect the processor, or by checking it on-demand during construction of your object.

I specifically chose to use hooking because I wanted to avoid users having to think about objects and documentation. To accomplish this, I wanted to expose registers as a "namespace" where each register was an immutable object (or singleton) that you could use to do explicit comparisons. In this way when the processor module changes, the registers that you now have available (and are used during decoding) will change to what your processor supports. Also since the registers are constants/singletons, you can compare them with is. I inherit each available register from a type, register_t, specifically so that you can test the type with isinstance() when trying to interpret the parts of a decoded operand.

As both your plugin, bip, and sark are object-oriented, I believe it'd be easier to implement the register and operand decoding components during construction so that you can avoid having to hook anything and limit the user when the processor for the database changes. This way when a user does something with your instruction type to get a specific operand, you can then construct it when requested, and use it wherever.

Register abstractions

So depending on your preference, you might need to write some abstraction around the extended versions of registers (rcx, ecx) and regular versions of registers (cx) for Intel. This is because IDA only seems to list the 16-bit versions of the registers in its register index. Thus when decoding an operand and using op_t.reg to reference some item in the register index, it'll be up to you to determine whether the item is a 64-bit, 32-bit, or 16-bit version (using get_dtype_size, but maybe there's a proper way). Because of this, it's likely significantly simpler to treat registers as strings and this way users can use standard string operations for matching register types (which is how people typically interact with operands anyways with regexs).

Another reason to consider doing registers on-demand is because some architectures (AArch64) don't actually list all of the available registers in the index. So if you do end up defining the registers, keep in mind that they're not always guaranteed to be in the register index and you'll need to specify them yourself.

With operands, I chose to use a named-tuple composed of a register_t and just integers. Some of its members can be a set() such as for reglist operands, and everything that's a non-integer inherits from symbol_t so that users can distinguish parts of the operand. In this, each operand decoder is determined by the processor in use (idaapi.PLFM_386), and the operand type (idaapi.o_displ). A negative side to doing it like this meant that I needed to expose the operand type to the user ('phrase', 'memory', 'immediate', etc). Originally, I didn't think it necessary for a user to know the operand type, but at one point I wanted to filter all calls to all of my functions where one of its parameters came from a particular operand type. So definitely something to consider.

Operand decoding abstractions

In an object-oriented approach for decoding operands, I don't think there's any way to work around having to implement a parent-class, and then having a bunch of child-classes inheriting from it which will individually interpret an instruction's idaapi.op_t. Really though, I definitely think this is the best way and not worth trying to work around unless you have a pretty cool reason.

But this way, each child class can then expose the individual properties of the operand and you can implement methods/properties that allow users to identify all the registers that are used by the operand, such as exposing the operand's size, etc. Some other properties you can also expose are whether the operand is being read from vs. written to. If you use strings for your registers, then you also won't need to maintain a relationship with your "register abstraction" as the differing register sizes can be determined by simply checking the operand's size such as when decoding an Intel operand.

Notes

In my opinion the mips, and arm architectures will definitely be a lot more straightforward than Intel's. The only one in arm that might be strange is (idaapi.PLFM_ARM, idaapi.o_mem) as the meaning of the operand is implied and so you'll need to dereference op_t.addr to get the value that the user actually cares about.

Sorry again for the length, hopefully this helps. :-)

(edited to label the topics)

Once you have the core operand type decoding implemented, you might have difficulty with one of the operand representations if you end up deciding to write an abstraction around them. I really had a hell of a time dealing with the structure+offset representation (with 't') for an operand in IDAPython. Function frames and the structure API are pretty much the same, and use a slightly different API when following their xrefs.

So my main issue was having to correlate the offset I received from decoding an instruction operand with the result of idaapi.get_stroff_path(). Once having calculated that offset, then I'd have to recursively descend into the structure to figure out the exact structure members that're being pointed to so that I could return the path (that was traversed) to the user. Then this way, the user would know which structure members, or member of an array are being used by a particular operand.

Essentially, not being able to return an object (and having to return native python types like lists) made it very tedious for me to make it useful for the user, heh. :-/

Netnodes

In some cases, I've also had to look at IDA's supvals/altvals to get the exact information that I wanted from an operand type as in those cases I couldn't find the correct API call. In case you didn't know this, every ID type in IDA is referencing a netnode in your database. So although netnodes can have names like $ funcs, or $ original user (which you end up converting to an ID), the values in idaapi.struc_t.id, idaapi.member_t.id, etc. are really netnode identifiers.

An address in your database is also a netnode identifier, and so both identifiers and addresses in the database occupy the exact same space. You can literally think of an identifier as really being just an address, and every address (with some attributes like a comment, or patched bytes) has a netnode associated with it. Xrefs are then used to link these identifiers/addresses together.

IDA distinguishes identifiers from addresses by setting the top 8-bits to 1s. So, say you have a segment at 0x00000007ffffe000, all identifiers will still be 0xFFxxxxxxxxxxxxxx (64-bit). If you find yourself needing to do "tricky things" in IDA, it's definitely worth it to write some wrappers around IDA's netnode api because it really sucks in its current state and makes it hard to explore.

This is amazing, thank you!

I will definitively being working on implementing both a register representation and remaking the operand representation in Bip when I have some time.

I new about the Netnodes API because I indirectly used it for implementing the xrefs, however I had no real reason to look at it in more details until now, so that is definitively also in the todo list.

Closing this issue as the relevant information has been exchanged. ;-)

synacktiv / bip