Change "opcode" output on pdj and add "operands" field.

jerc33 commented 3 years ago

Is your feature request related to a problem? Please describe. When using the pdj command, the "disasm" and "opcode" elements are almost always the same, like so:

  {
    "offset": 94321603838124,
    "esil": "rsi,rbp,=",
    "refptr": false,
    "fcn_addr": 94321603838112,
    "fcn_last": 94321603844853,
    "size": 3,
    "opcode": "mov rbp, rsi",
    "disasm": "mov rbp, rsi",
    "bytes": "4889f5",
    "family": "cpu",
    "type": "mov",
    "reloc": false,
    "type_num": 9,
    "type2_num": 0
  },

...there's only a few exceptions like this one:

  {
    "offset": 94321603838144,
    "esil": "rax,0x38,rsp,+,=[8]",
    "refptr": true,
    "fcn_addr": 94321603838112,
    "fcn_last": 94321603844851,
    "size": 5,
    "opcode": "mov qword [rsp + 0x38], rax",
    "disasm": "mov qword [var_38h], rax",
    "bytes": "4889442438",
    "family": "cpu",
    "type": "mov",
    "reloc": false,
    "type_num": 9,
    "type2_num": 0
  },

Also, it gets a bit troublesome trying to parse the "opcode" field itself if, in my use-case for example, I write a python script with rz-pipe that needs to know each and every operand of an instruction, not just the opcode or the complete instruction.

Describe the solution you'd like What is proposed in this issue is a change in the pdj output with the addition of a new json element. The "opcode" field would give only the instruction opcode like mov and the operands would separated and given in the new element "operands" as a list of strings, like so:

    "opcode": "mov",
    "operands": ["qword", "[rsp + 0x38]", "rax"],
    "disasm": "mov qword [var_38h], rax",

Note that the qword formatting element is included in the operands list. While not being necessarily an operand, it is a very important element in an instruction and the special words describing the length of the data are few and easily managed by the user if they don't need them.

Describe alternatives you've considered An neat alternative to this is to separate the operands in two lists, the ones before the instruction colon, and after the colon, displaying them as a list of lists, or an array of arrays in json parlance, like so:

    "opcode": "mov",
    "operands": [["qword", "[rsp + 0x38]"], ["rax"]],
    "disasm": "mov qword [var_38h], rax",

This is not necessary for my specific use-case right now but might be a valuable feature in the future for me or someone else.

As for using the output of "opcode" or "disasm", I believe that the later should not be changed because it is the direct output of pd, as you can see here:

0x55c8f5b100b7      64488b042528.  mov rax, qword fs:[0x28]
0x55c8f5b100c0      4889442438     mov qword [var_38h], rax
0x55c8f5b100c5      31c0           xor eax, eax

So, imho, pdj should have a field with the exact same output as pd.

karliss commented 3 years ago

The suggested examples for how operand printing could look like is in my opinion too close to Intel x86 syntax. This has some problems

consuming it would still require quite a bit intel x86 specific code, not even x86 as with AT&T syntax it would be completely different. If the consumer needs to so much architecture specific work it might as well do the splitting.
printing such result would require quite a bit of disassembly parsing to do such splitting precisely, and then printing it that way would throw away half of the information.
wouldn't work well for different architectures

Some alternative approaches:

Use ESIL (or other language for describing full instruction semantics) which not only includes more precise information about all the arguments but also implicit arguments and semantics of instruction itself. It isn't affected by differences between Intel and AT&T syntax or differences between architectures.
Expose src and dst fields. I don't think they are currently available using rzpipe and commands. See declaration of RzAnalysisOp and RzAnalysisValue . This in my opinion would be the closest to approach suggested above but without details of specific architecture disassembly syntax. And rizin already has this information.
Using RzAnnotatedCode or something similar which attaches structured meaning to the ranges of printed disassembly text. This approach would be more suitable for interactive tools making UI on top of rizin. Like visual mode, less hacky way for replacing names within disassembly mov qword [rsp + 0x38], rax -> mov qword [var_38h] and doing other pretty printing, Cutter.

karliss commented 3 years ago

If more stuff gets added to pdj, it might be worth considering adding option for selecting what information caller needs or having multiple commands that print different subsets of information.

jerc33 commented 3 years ago

@karliss

Expose src and dst fields.

I forgot to mention this. This proposed change was under the assumption that rizin already had this information internally and that it wouldn't take much effort to make it available to the user. That approach you mention sounds rather interesting.

But I don't know src and dst fields are formatted, specifically on cases where the instruction has 3 or 4 operands. Here's an example:

{
    "offset": 140025470670393,
    "esil": "",
    "refptr": false,
    "fcn_addr": 0,
    "fcn_last": 0,
    "size": 8,
    "opcode": "vpsrld xmm15, xmm9, xmmword [rax + 0x1000000]",
    "disasm": "vpsrld xmm15, xmm9, xmmword [rax + 0x1000000]",
    "bytes": "c531d2b800000001",
    "family": "cpu",
    "type": "null",
    "reloc": false,
    "type_num": 0,
    "type2_num": 0
  },

Of course this is on the mmx range of instructions, so, not really useful for Function Analysis. But for my use-case which is very much Data-Flow Analysis at the machine language level, it is an important piece of information.

Also, notice that there's no ESIL output, making it unreliable for this use case. (Side question, does ESIL support floating point instructions?)

As for RzAnnotatedCode is there a way to see the information it contains? This sounds interesting as well. Specially if it helps Visual Mode and Cutter, both of which I use as well and would gladly see them grow.

And finally, thanks for your reply karliss, it's very much appreciated.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. Considering a lot has probably changed since its creation, we kindly ask you to check again if the issue you reported is still relevant in the current version of rizin. If it is, update this issue with a comment, otherwise it will be automatically closed if no further activity occurs. Thank you for your contributions.

ret2libc commented 3 years ago

Hi! I think right now there is no work happening in this regard, however there is something called opex which may be useful in your case.

[0x00006b64]> aoj~{}
[
  {
    "opcode": "xor ebp, ebp",
    "disasm": "xor ebp, ebp",
    "pseudo": "ebp = 0",
    "description": "logical exclusive or",
    "mnemonic": "xor",
    "mask": "ffff",
    "esil": "ebp,rbp,^,0xffffffff,&,rbp,=,$z,zf,:=,$p,pf,:=,31,$s,sf,:=,0,cf,:=,0,of,:=",
    "sign": false,
    "prefix": 0,
    "id": 334,
    "opex": {
      "operands": [
        {
          "size": 4,
          "rw": 3,
          "type": "reg",
          "value": "ebp"
        },
        {
          "size": 4,
          "rw": 1,
          "type": "reg",
          "value": "ebp"
        }
      ],
      "modrm": true
    },
    "addr": 27492,
    "bytes": "31ed",
    "size": 2,
    "type": "xor",
    "esilcost": 0,
    "scale": 0,
    "refptr": 0,
    "cycles": 1,
    "failcycles": 0,
    "delay": 0,
    "stackptr": 0,
    "family": "cpu"
  }
]

jerc33 commented 2 years ago

Hi, @ret2libc , that looks very much like what I need. I'm terribly sorry for letting this issue become stale.

I've been playing around with aoj and opex, and as far as I can tell it is a perfect fit for what I need, and more. From my part, this issue/feature_req, can be marked as closed.

Thank you very much for all your help.

ret2libc commented 2 years ago

No problem! Happy to help

rizinorg / rizin

Change "opcode" output on pdj and add "operands" field. #284