Open boborjan2 opened 4 days ago
What’s your benchmark code and did you try dev branch?
Hi, I use this code: https://github.com/qemu/qemu/blob/master/tests/tcg/i386/test-i386.c in a loop of 100000, embedded in a 32bit windows exe that is loaded to unicorn. It is compiled with -O0, printfs omitted. I guess a simpler example would be more welcome.? Yes, I use today's tip of dev branch. Btw I made a profiling using gprof, this is the top (this is with all the PRs mentioned up there): 29.34 0.49 0.49 helper_lookup_tb_ptr_x86_64 17.37 0.78 0.29 qht_lookup_custom 12.57 0.99 0.21 tb_htable_lookup_x86_64 9.58 1.15 0.16 cpu_exec_x86_64 8.38 1.29 0.14 tb_lookup_cmp
Unicorn doesn't have system emulation, how do you deal with syscalls?
the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.
I try to extract the test code loop and create a stand-alone .c to make it easier to reproduce.
I don’t have any specific clue before having the concrete benchmark code. Maybe you can pprof the slowest version and see bottlenecks.
From: boborjan2 @.> Sent: Thursday, July 4, 2024 12:15:04 AM To: unicorn-engine/unicorn @.> Cc: lazymio @.>; Comment @.> Subject: Re: [unicorn-engine/unicorn] Performance difference of Unicorn 1 and 2 (Issue #1970)
the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.
― Reply to this email directly, view it on GitHubhttps://github.com/unicorn-engine/unicorn/issues/1970#issuecomment-2206728663, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHJULO6BWVMPHDSTBKK3KTTZKQPQRAVCNFSM6AAAAABKJRHTPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG4ZDQNRWGM. You are receiving this because you commented.Message ID: @.***>
I extracted a subset of the test suite and injected it to the shellcode sample. I reduced the test to this simple case:
static inline void test_bsx(void) __attribute__((always_inline));
static inline void test_bsx(void)
{
TEST_BSX(bsrw, "w", 0);
TEST_BSX(bsrw, "w", 0x12340128);
TEST_BSX(bsfw, "w", 0);
TEST_BSX(bsfw, "w", 0x12340128);
TEST_BSX(bsrl, "k", 0);
TEST_BSX(bsrl, "k", 0x00340128);
TEST_BSX(bsfl, "k", 0);
TEST_BSX(bsfl, "k", 0x00340128);
}
void test2(void)
{
for(int i = 0; i < 20000000; i++) {
test_bsx();
}
}
I compiled it with -O0 and extracted test2 code into a c array and loaded into the shellcode sample:
#include <unicorn/unicorn.h>
#include <string.h>
const uint8_t test_code[276] = {
0x55, 0x89, 0xE5, 0x83, 0xEC, 0x70, 0xC7, 0x45, 0xFC, 0x00, 0x00, 0x00, 0x00, 0xE9, 0xF1, 0x00,
0x00, 0x00, 0xC7, 0x45, 0xF8, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xF8, 0x31, 0xC0, 0xBA, 0x78,
0x56, 0x34, 0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xF4, 0x89, 0x45, 0xF0,
0xC7, 0x45, 0xEC, 0x28, 0x01, 0x34, 0x12, 0x8B, 0x4D, 0xEC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34,
0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xE8, 0x89, 0x45, 0xE4, 0xC7, 0x45,
0xE0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xE0, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66,
0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xDC, 0x89, 0x45, 0xD8, 0xC7, 0x45, 0xD4, 0x28,
0x01, 0x34, 0x12, 0x8B, 0x4D, 0xD4, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66, 0x0F, 0xBC,
0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xD0, 0x89, 0x45, 0xCC, 0xC7, 0x45, 0xC8, 0x00, 0x00, 0x00,
0x00, 0x8B, 0x4D, 0xC8, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94,
0xC0, 0x89, 0x55, 0xC4, 0x89, 0x45, 0xC0, 0xC7, 0x45, 0xBC, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D,
0xBC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55,
0xB8, 0x89, 0x45, 0xB4, 0xC7, 0x45, 0xB0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xB0, 0x31, 0xC0,
0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xAC, 0x89, 0x45,
0xA8, 0xC7, 0x45, 0xA4, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D, 0xA4, 0x31, 0xC0, 0xBA, 0x78, 0x56,
0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xA0, 0x89, 0x45, 0x9C, 0x90, 0x83,
0x45, 0xFC, 0x01, 0x81, 0x7D, 0xFC, 0xFF, 0x2C, 0x31, 0x01, 0x0F, 0x8E, 0x02, 0xFF, 0xFF, 0xFF,
0x90, 0x90, 0xC9, 0xC3,
};
// memory address where emulation starts
#define ADDRESS 0x1000000
#define MIN(a, b) (a < b ? a : b)
static void test_i386(void)
{
uc_engine *uc;
uc_err err;
int r_esp = ADDRESS + 0x200000; // ESP register
printf("Emulate i386 code\n");
// Initialize emulator in X86-32bit mode
err = uc_open(UC_ARCH_X86, UC_MODE_32, &uc);
if (err) {
printf("Failed on uc_open() with error returned: %u\n", err);
return;
}
// map 2MB memory for this emulation
uc_mem_map(uc, ADDRESS, 2 * 1024 * 1024, UC_PROT_ALL);
// write machine code to be emulated to memory
if (uc_mem_write(uc, ADDRESS, test_code,
sizeof(test_code) - 1)) {
printf("Failed to write emulation code to memory, quit!\n");
return;
}
// initialize machine registers
uc_reg_write(uc, UC_X86_REG_ESP, &r_esp);
// emulate machine code in infinite time
// err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(X86_CODE32_SELF), 0,
// 12); <--- emulate only 12 instructions
err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(test_code) - 2, 0, 0);
if (err) {
printf("Failed on uc_emu_start() with error returned %u: %s\n", err,
uc_strerror(err));
}
printf("\n>>> Emulation done.\n");
uc_close(uc);
}
int main(int argc, char **argv, char **envp)
{
test_i386();
return 0;
}
The performance differences are approx. the same as above ow even worse. I bechmark it with "time ./shellcode". Unicorn2 is compiled with "cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" ." as above.
We have been using unicorn 1 for a while and are in the process of switching to unicorn v2 due to some bugs already fixed there etc. I have performed a simple benchmark (using qemu's test-i386.c without the printfs run a few thousand times in a loop). Unicorn2 (branch dev) is compiled by 'cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" .' to enable all optimization we can get.
Interestingly Unicorn 1 is a magnitude faster: ~3.6s vs ~99s on my setup. I checked the milestones and included PR #1838 -> 49.15s Even including #1839 (I am not sure if it's going to be merged) -> 4.8s
This last is comparable to v1. I assume the difference is caused bu using QEMU 5 vs 2.x. What features does v5 have that justifies this? (btw I also tried uc_ctl_tlb_mode(uc, UC_TLB_VIRTUAL) but that just makes execution slower(?))
Benchmarks uses UC_ARCH_X86, UC_MODE_32.
Any comment is welcome, Thanks, Viktor