Performance difference of Unicorn 1 and 2

boborjan2 commented 4 days ago

We have been using unicorn 1 for a while and are in the process of switching to unicorn v2 due to some bugs already fixed there etc. I have performed a simple benchmark (using qemu's test-i386.c without the printfs run a few thousand times in a loop). Unicorn2 (branch dev) is compiled by 'cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" .' to enable all optimization we can get.

Interestingly Unicorn 1 is a magnitude faster: ~3.6s vs ~99s on my setup. I checked the milestones and included PR #1838 -> 49.15s Even including #1839 (I am not sure if it's going to be merged) -> 4.8s

This last is comparable to v1. I assume the difference is caused bu using QEMU 5 vs 2.x. What features does v5 have that justifies this? (btw I also tried uc_ctl_tlb_mode(uc, UC_TLB_VIRTUAL) but that just makes execution slower(?))

Benchmarks uses UC_ARCH_X86, UC_MODE_32.

Any comment is welcome, Thanks, Viktor

wtdcode commented 4 days ago

What’s your benchmark code and did you try dev branch?

boborjan2 commented 4 days ago

Hi, I use this code: https://github.com/qemu/qemu/blob/master/tests/tcg/i386/test-i386.c in a loop of 100000, embedded in a 32bit windows exe that is loaded to unicorn. It is compiled with -O0, printfs omitted. I guess a simpler example would be more welcome.? Yes, I use today's tip of dev branch. Btw I made a profiling using gprof, this is the top (this is with all the PRs mentioned up there): 29.34 0.49 0.49 helper_lookup_tb_ptr_x86_64 17.37 0.78 0.29 qht_lookup_custom 12.57 0.99 0.21 tb_htable_lookup_x86_64 9.58 1.15 0.16 cpu_exec_x86_64 8.38 1.29 0.14 tb_lookup_cmp

wtdcode commented 4 days ago

Unicorn doesn't have system emulation, how do you deal with syscalls?

boborjan2 commented 4 days ago

the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.

boborjan2 commented 4 days ago

I try to extract the test code loop and create a stand-alone .c to make it easier to reproduce.

wtdcode commented 4 days ago

I don’t have any specific clue before having the concrete benchmark code. Maybe you can pprof the slowest version and see bottlenecks.

From: boborjan2 @.> Sent: Thursday, July 4, 2024 12:15:04 AM To: unicorn-engine/unicorn @.> Cc: lazymio @.>; Comment @.> Subject: Re: [unicorn-engine/unicorn] Performance difference of Unicorn 1 and 2 (Issue #1970)

the syscalls that are needed for these simple executables are implemented using int3 hooks. During the benchmark there are no hooks btw. Printfs are macroed out.

― Reply to this email directly, view it on GitHubhttps://github.com/unicorn-engine/unicorn/issues/1970#issuecomment-2206728663, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHJULO6BWVMPHDSTBKK3KTTZKQPQRAVCNFSM6AAAAABKJRHTPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG4ZDQNRWGM. You are receiving this because you commented.Message ID: @.***>

boborjan2 commented 3 days ago

I extracted a subset of the test suite and injected it to the shellcode sample. I reduced the test to this simple case:

static inline void test_bsx(void) __attribute__((always_inline));
static inline void test_bsx(void)
{
    TEST_BSX(bsrw, "w", 0);
    TEST_BSX(bsrw, "w", 0x12340128);
    TEST_BSX(bsfw, "w", 0);
    TEST_BSX(bsfw, "w", 0x12340128);
    TEST_BSX(bsrl, "k", 0);
    TEST_BSX(bsrl, "k", 0x00340128);
    TEST_BSX(bsfl, "k", 0);
    TEST_BSX(bsfl, "k", 0x00340128);
}
void test2(void)
{
    for(int i = 0; i < 20000000; i++) {
        test_bsx();
    }
}

I compiled it with -O0 and extracted test2 code into a c array and loaded into the shellcode sample:

#include <unicorn/unicorn.h>
#include <string.h>
const uint8_t test_code[276] = {
    0x55, 0x89, 0xE5, 0x83, 0xEC, 0x70, 0xC7, 0x45, 0xFC, 0x00, 0x00, 0x00, 0x00, 0xE9, 0xF1, 0x00,
    0x00, 0x00, 0xC7, 0x45, 0xF8, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xF8, 0x31, 0xC0, 0xBA, 0x78,
    0x56, 0x34, 0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xF4, 0x89, 0x45, 0xF0,
    0xC7, 0x45, 0xEC, 0x28, 0x01, 0x34, 0x12, 0x8B, 0x4D, 0xEC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34,
    0x12, 0x66, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xE8, 0x89, 0x45, 0xE4, 0xC7, 0x45,
    0xE0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xE0, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66,
    0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xDC, 0x89, 0x45, 0xD8, 0xC7, 0x45, 0xD4, 0x28,
    0x01, 0x34, 0x12, 0x8B, 0x4D, 0xD4, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x66, 0x0F, 0xBC,
    0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xD0, 0x89, 0x45, 0xCC, 0xC7, 0x45, 0xC8, 0x00, 0x00, 0x00,
    0x00, 0x8B, 0x4D, 0xC8, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94,
    0xC0, 0x89, 0x55, 0xC4, 0x89, 0x45, 0xC0, 0xC7, 0x45, 0xBC, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D,
    0xBC, 0x31, 0xC0, 0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBD, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55,
    0xB8, 0x89, 0x45, 0xB4, 0xC7, 0x45, 0xB0, 0x00, 0x00, 0x00, 0x00, 0x8B, 0x4D, 0xB0, 0x31, 0xC0,
    0xBA, 0x78, 0x56, 0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xAC, 0x89, 0x45,
    0xA8, 0xC7, 0x45, 0xA4, 0x28, 0x01, 0x34, 0x00, 0x8B, 0x4D, 0xA4, 0x31, 0xC0, 0xBA, 0x78, 0x56,
    0x34, 0x12, 0x0F, 0xBC, 0xD1, 0x0F, 0x94, 0xC0, 0x89, 0x55, 0xA0, 0x89, 0x45, 0x9C, 0x90, 0x83,
    0x45, 0xFC, 0x01, 0x81, 0x7D, 0xFC, 0xFF, 0x2C, 0x31, 0x01, 0x0F, 0x8E, 0x02, 0xFF, 0xFF, 0xFF,
    0x90, 0x90, 0xC9, 0xC3,
};

// memory address where emulation starts
#define ADDRESS 0x1000000

#define MIN(a, b) (a < b ? a : b)

static void test_i386(void)
{
    uc_engine *uc;
    uc_err err;

    int r_esp = ADDRESS + 0x200000; // ESP register

    printf("Emulate i386 code\n");

    // Initialize emulator in X86-32bit mode
    err = uc_open(UC_ARCH_X86, UC_MODE_32, &uc);
    if (err) {
        printf("Failed on uc_open() with error returned: %u\n", err);
        return;
    }

    // map 2MB memory for this emulation
    uc_mem_map(uc, ADDRESS, 2 * 1024 * 1024, UC_PROT_ALL);

    // write machine code to be emulated to memory
    if (uc_mem_write(uc, ADDRESS, test_code,
                     sizeof(test_code) - 1)) {
        printf("Failed to write emulation code to memory, quit!\n");
        return;
    }

    // initialize machine registers
    uc_reg_write(uc, UC_X86_REG_ESP, &r_esp);

    // emulate machine code in infinite time
    // err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(X86_CODE32_SELF), 0,
    // 12); <--- emulate only 12 instructions
    err = uc_emu_start(uc, ADDRESS, ADDRESS + sizeof(test_code) - 2, 0, 0);
    if (err) {
        printf("Failed on uc_emu_start() with error returned %u: %s\n", err,
               uc_strerror(err));
    }

    printf("\n>>> Emulation done.\n");

    uc_close(uc);
}

int main(int argc, char **argv, char **envp)
{
    test_i386();

    return 0;
}

The performance differences are approx. the same as above ow even worse. I bechmark it with "time ./shellcode". Unicorn2 is compiled with "cmake -DUNICORN_ARCH=x86 -DCMAKE_C_FLAGS="-march=native -O3" ." as above.

unicorn-engine / unicorn

Performance difference of Unicorn 1 and 2 #1970