Closed arizvisa closed 4 years ago
Hello!
Thank you for getting in touch, greatly appreciated :)
Yes, the operand is indeed in my todo-list, it is actually one of the first things I did and it clearly needs to be remade. One of the problems I have with this is that, right now, there is nothing in Bip which is dependent of the processor/architecture. I have not yet decided on how to implement that and in all cases it will take some times. If you have some inputs on how to do that, I would be glad to hear them.
For the usage of the op_t
members/optype_t
enum, here is what I think I understand.
For Intel, from what I gather, the specflag1
indicate if the Scaled Index
Byte (SIB) is used while the specflag2
is directly the value of the SIB
(see Intel Manual Vol. 2 Chapter 2 "Instruction Format" for the encoding).
This seems to match what you are describing.
Here is a dump of some code from the Intel sdk include/intel.hpp
which seems to match:
#define segrg specval_shorts.high
#define SEGREG_IMM 0xFFFF // this value of segrg means that
// segment selector value is in
// "segsel":
#define segsel specval_shorts.low
#define hasSIB specflag1
#define sib specflag2
#define rex insnpref // REX byte for 64-bit mode, or bits from the VEX byte if vexpr()
// Op6 is used for opmask registers in EVEX.
// specflags from Op6 are used to extend insn_t.
#define evex_flags Op6.specflag2 // bits from the EVEX byte if evexpr()
#define cr_suff specflag1 // o_crreg: D suffix for cr registers (used for CR8D)
// [...]
inline int sib_base(const insn_t &insn, const op_t &x) // get extended sib base
{
int base = x.sib & 7;
#ifdef __EA64__
if ( insn.rex & REX_B )
base |= 8;
#else
qnotused(insn);
#endif
return base;
}
inline regnum_t sib_index(const insn_t &insn, const op_t &x) // get extended sib index
{
regnum_t index = regnum_t((x.sib >> 3) & 7);
#ifdef __EA64__
if ( (insn.rex & REX_X) != 0 )
index |= 8;
#endif
if ( is_vsib(insn) )
{
if ( (insn.evex_flags & EVEX_V) != 0 )
index |= 16;
index = vsib_index_fixreg(insn, index);
}
return index;
}
inline int sib_scale(const op_t &x)
{
int scale = (x.sib >> 6) & 3;
return scale;
}
// [...] This is follow by other helpers
For ARM64, I am sadly not as familiar with the architecture, but in the idasdk the file module/arm/arm.hpp
seems to match what you describe with some more precision. Here is the code I think you will find relevant (I am not sure if it is the real implementation):
// Operand types:
#define o_shreg o_idpspec0 // Shifted register
// op.reg - register
#define shtype specflag2 // op.shtype - shift type
#define shreg(x) uchar(x.specflag1) // op.shreg - shift register
#define shcnt value // op.shcnt - shift counter
#define ishtype specflag2 // o_imm - shift type
#define ishcnt specval // o_imm - shift counter
#define secreg(x) uchar(x.specflag1) // o_phrase: the second register is here
#define ralign specflag3 // o_phrase, o_displ: NEON alignment (power-of-two bytes, i.e. 8*(1<<a))
// minimal alignment is 16 (a==1)
#define simd_sz specflag1 // o_reg: SIMD vector element size
// 0=scalar, 1=8 bits, 2=16 bits, 3=32 bits, 4=64 bits, 5=128 bits)
// number of lanes is derived from the vector size (dtype)
#define simd_idx specflag3 // o_reg: SIMD scalar index plus 1 (Vn.H[i])
// o_phrase: the second register is held in secreg (specflag1)
// the shift type is in shtype (specflag2)
// the shift counter is in shcnt (value)
#define o_reglist o_idpspec1 // Register list (for LDM/STM)
#define reglist specval // The list is in op.specval
#define uforce specflag1 // PSR & force user bit (^ suffix)
#define o_creglist o_idpspec2 // Coprocessor register list (for CDP)
#define CRd reg //
#define CRn specflag1 //
#define CRm specflag2 //
#define o_creg o_idpspec3 // Coprocessor register (for LDC/STC)
#define o_fpreglist o_idpspec4 // Floating point register list
#define fpregstart reg // First register
#define fpregcnt value // number of registers; 0: single register (NEON scalar)
#define fpregstep specflag2 // register spacing (0: {Dd, Dd+1,... }, 1: {Dd, Dd+2, ...} etc)
#define fpregindex specflag3 // NEON scalar index plus 1 (Dd[x])
#define NOINDEX (char)254 // no index - all lanes (Dd[])
#define o_text o_idpspec5 // Arbitrary text stored in the operand
// structure starting at the 'value' field
// up to 16 bytes (with terminating zero)
#define o_cond o_idpspec5+1 // ARM condition as an operand
// condition is stored in 'value' field
// The processor number of coprocessor instructions is held in cmd.Op1.specflag1:
#define procnum specflag1
// bits stored in specflag1 for APSR register
#define APSR_nzcv 0x01
#define APSR_q 0x02
#define APSR_g 0x04
// for SPSR/CPSR
#define CPSR_c 0x01
#define CPSR_x 0x02
#define CPSR_s 0x04
#define CPSR_f 0x08
// for banked registers (R8-R12, SP, LR/ELR, SPSR), this flag is set
#define BANKED_MODE 0x80 // the mode is in low 5 bits (arm_mode_t)
The module
folder from the idasdk also contains some information for other processors.
Ahh. Okay. Awesome. Looks like I have a couple more operand types I can now properly implement as I really didn't know about these details since I first had to figure it out during the 6.x series. ;-)
(apologies for the verbosity, especially if there's a language difference)
In terms of ways to implement them, you might've already thought of these methods but I'll let you know why I chose my methodology in particular. So the two main ways to distinguish registers and operands are by using hooks (IDP_Hooks.ev_newprc
) to detect the processor, or by checking it on-demand during construction of your object.
I specifically chose to use hooking because I wanted to avoid users having to think about objects and documentation. To accomplish this, I wanted to expose registers as a "namespace" where each register was an immutable object (or singleton) that you could use to do explicit comparisons. In this way when the processor module changes, the registers that you now have available (and are used during decoding) will change to what your processor supports. Also since the registers are constants/singletons, you can compare them with is
. I inherit each available register from a type, register_t
, specifically so that you can test the type with isinstance()
when trying to interpret the parts of a decoded operand.
As both your plugin, bip, and sark are object-oriented, I believe it'd be easier to implement the register and operand decoding components during construction so that you can avoid having to hook anything and limit the user when the processor for the database changes. This way when a user does something with your instruction type to get a specific operand, you can then construct it when requested, and use it wherever.
So depending on your preference, you might need to write some abstraction around the extended versions of registers (rcx
, ecx
) and regular versions of registers (cx
) for Intel. This is because IDA only seems to list the 16-bit versions of the registers in its register index. Thus when decoding an operand and using op_t.reg
to reference some item in the register index, it'll be up to you to determine whether the item is a 64-bit, 32-bit, or 16-bit version (using get_dtype_size
, but maybe there's a proper way). Because of this, it's likely significantly simpler to treat registers as strings and this way users can use standard string operations for matching register types (which is how people typically interact with operands anyways with regexs).
Another reason to consider doing registers on-demand is because some architectures (AArch64
) don't actually list all of the available registers in the index. So if you do end up defining the registers, keep in mind that they're not always guaranteed to be in the register index and you'll need to specify them yourself.
With operands, I chose to use a named-tuple composed of a register_t
and just integers. Some of its members can be a set()
such as for reglist operands, and everything that's a non-integer inherits from symbol_t
so that users can distinguish parts of the operand. In this, each operand decoder is determined by the processor in use (idaapi.PLFM_386), and the operand type (idaapi.o_displ). A negative side to doing it like this meant that I needed to expose the operand type to the user ('phrase', 'memory', 'immediate', etc). Originally, I didn't think it necessary for a user to know the operand type, but at one point I wanted to filter all calls to all of my functions where one of its parameters came from a particular operand type. So definitely something to consider.
In an object-oriented approach for decoding operands, I don't think there's any way to work around having to implement a parent-class, and then having a bunch of child-classes inheriting from it which will individually interpret an instruction's idaapi.op_t
. Really though, I definitely think this is the best way and not worth trying to work around unless you have a pretty cool reason.
But this way, each child class can then expose the individual properties of the operand and you can implement methods/properties that allow users to identify all the registers that are used by the operand, such as exposing the operand's size, etc. Some other properties you can also expose are whether the operand is being read from vs. written to. If you use strings for your registers, then you also won't need to maintain a relationship with your "register abstraction" as the differing register sizes can be determined by simply checking the operand's size such as when decoding an Intel operand.
In my opinion the mips, and arm architectures will definitely be a lot more straightforward than Intel's. The only one in arm that might be strange is (idaapi.PLFM_ARM, idaapi.o_mem)
as the meaning of the operand is implied and so you'll need to dereference op_t.addr
to get the value that the user actually cares about.
Sorry again for the length, hopefully this helps. :-)
(edited to label the topics)
Once you have the core operand type decoding implemented, you might have difficulty with one of the operand representations if you end up deciding to write an abstraction around them. I really had a hell of a time dealing with the structure+offset representation (with 't') for an operand in IDAPython. Function frames and the structure API are pretty much the same, and use a slightly different API when following their xrefs.
So my main issue was having to correlate the offset I received from decoding an instruction operand with the result of idaapi.get_stroff_path()
. Once having calculated that offset, then I'd have to recursively descend into the structure to figure out the exact structure members that're being pointed to so that I could return the path (that was traversed) to the user. Then this way, the user would know which structure members, or member of an array are being used by a particular operand.
Essentially, not being able to return an object (and having to return native python types like lists) made it very tedious for me to make it useful for the user, heh. :-/
In some cases, I've also had to look at IDA's supvals/altvals to get the exact information that I wanted from an operand type as in those cases I couldn't find the correct API call. In case you didn't know this, every ID type in IDA is referencing a netnode in your database. So although netnodes can have names like $ funcs
, or $ original user
(which you end up converting to an ID), the values in idaapi.struc_t.id
, idaapi.member_t.id
, etc. are really netnode identifiers.
An address in your database is also a netnode identifier, and so both identifiers and addresses in the database occupy the exact same space. You can literally think of an identifier as really being just an address, and every address (with some attributes like a comment, or patched bytes) has a netnode associated with it. Xrefs are then used to link these identifiers/addresses together.
IDA distinguishes identifiers from addresses by setting the top 8-bits to 1s. So, say you have a segment at 0x00000007ffffe000, all identifiers will still be 0xFFxxxxxxxxxxxxxx (64-bit). If you find yourself needing to do "tricky things" in IDA, it's definitely worth it to write some wrappers around IDA's netnode api because it really sucks in its current state and makes it hard to explore.
This is amazing, thank you!
I will definitively being working on implementing both a register representation and remaking the operand representation in Bip when I have some time.
I new about the Netnodes API because I indirectly used it for implementing the xrefs, however I had no real reason to look at it in more details until now, so that is definitively also in the todo list.
Closing this issue as the relevant information has been exchanged. ;-)
Hiya. I'm the dever of the ida-minsc plugin and just heard about your project. I'm glad to see that people are comming to a realization about how much IDAPython sucks. Anyways, just wanted to point out some things about ida's instruction operands since they appear next in your todo for the operand module, and they're super-undocumented because I believe they're each specific to the processor module that's used for disassembling
Grabbing the operand semantics are generally pretty straightforward on the risc architectures as they're in one of the attributes of the op_t. These indexes (such as in op_t.reg) are referencing the list in
idaapi.ph_get_regnames()
, or whatever wrapper you prefer using. For numerical registers (such as ST(4), etc), the value in op_t.reg typically represent just the numerical part of the register.Intel
idaapi.o_phrase
|idaapi.o_displ
:op_t.specflag1
contains an enumeration essentially, andop_t.specflag2
contains masks for your different components. In at&t syntax, your phrases/displ look likeoffset(base, index, scale)
. I have the values for specflag1 listed at https://github.com/arizvisa/ida-minsc/blob/master/base/instruction.py#L1428. But for identifying the different components,specflag2 & 7
will contain the base-register, andspecflag2 & 0x38
is for the index register. The 2-bits forspecflag2 & 0xc0
represent the scale (1, 2, 4, 8).op_t.addr
then simply the offset.AArch
idaapi.o_phrase
:Rn
is inop_t.reg
,op_t.addr
contains the offset.idaapi.o_idpspec0
(trap) :op_t.value
is your simply your index.idaapi.o_idpspec1
(list) :op_t.specval
is essentially a bitmask of flags where each index corresponds to whether a register is included in the list or not. Each index of the integer maps to the register names.idaapi.o_idpspec4
(extlist) :op_t.value
contains an enumeration that specifies D8, or D8-D9, etc.idaapi.o_idpspec5+1
(condition): It seems thatop.value
,op.reg
, andop.n
are relevant, but I haven't fully done this one yet.If you discover any others, I'd be interested in hearing about them and I'm sure the Sark author will as well.