Strange injection of an extra byte in sub operations with python binding

Hi,

I'm experiencing a strange issue when assembling SUB operations. If I try to subtract any byte value of 0x80 or over, keystone inserts an extra 0xc2 in the bytecode. I've set up a quick test here:

# Set up the test
from binascii import hexlify
from capstone import *
from keystone import *
cs = Cs(CS_ARCH_X86, CS_MODE_32)
ks = Ks(KS_ARCH_X86, KS_MODE_32)

# Test1 - works fine.
test1_ok = "".join(map(lambda x: chr(x), ks.asm("sub eax, 0x7f7f7f7f")[0])).encode()
print("Assembled: {}".format(hexlify(test1_ok)))
print("Disassembled:")
for i in cs.disasm(test1_ok, 0x1000):
    print("0x{:02X}: {} {}".format(i.address, i.mnemonic, i.op_str))

# Test2 - injects 0xc2 before each byte. This is the same from 0x80 to 0xff.
test2_nok = "".join(map(lambda x: chr(x), ks.asm("sub eax, 0x80808080")[0])).encode()
print("Assembled: {}".format(hexlify(test2_nok)))
print("Disassembled:")
for i in cs.disasm(test2_nok, 0x1000):
    print("0x{:02X}: {} {}".format(i.address, i.mnemonic, i.op_str))

Assembled: b'2d7f7f7f7f' Disassembled: 0x1000: sub eax, 0x7f7f7f7f Assembled: b'2dc280c280c280c280' Disassembled: 0x1000: sub eax, 0x80c280c2 0x1005: ret 0xc280

I've also noticed that if I try to sub ebx, 0x7f7f7f7f, the bytecode also has a 0xc2 in it -- this seems to be because the bytecode would legitimately be 0x81eb7f7f7f7f I believe. I've tested what I've found here with a few other assemblers.

The version of keystone engine I have installed is:

keystone-engine (0.9.1-3)                    - Keystone assembler engine
  INSTALLED: 0.9.1.post3
  LATEST:    0.9.1-3

Hope this helps! And kudos for an awesome library.

ADDENDUM:

Looking into this a bit more, it seems that the source of the issue is with the python bindings. This is the output from the kstool:

$ kstool x32 "sub eax, 0x80808080"
sub eax, 0x80808080 = [ 2d 80 80 80 80 ]

keystone is working fine. The tests in the OP are simply incorrect in how they construct bytestrings:

test2_nok = "".join(map(lambda x: chr(x), ks.asm("sub eax, 0x80808080")[0])).encode()
Here, ks.asm correctly returns [0x2d, 0x80, 0x80, 0x80, 0x80]. Then using chr and "".join(), the above snippet interprets this as a list of unicode codepoints. I.e. it constructs the string '\u002d\u0080\u0080\u0080\u0080' and then encodes it into UTF-8. All unicode codepoints above 0x7f are multibyte sequences in UTF-8; this is where the 0xc2 bytes come from.
>> '\u002d\u0080\u0080\u0080\u0080'.encode('utf-8')
b'-\xc2\x80\xc2\x80\xc2\x80\xc2\x80'
The correct way to convert a list of integers into a bytes object is to use the bytes constructor:
>> bytes(ks.asm("sub eax, 0x80808080")[0])
b'-\x80\x80\x80\x80'

keystone-engine / keystone

Strange injection of an extra byte in sub operations with python binding #443