junxzm1990 / x86-sok

124 stars 20 forks source link

Does SOK support inline data? #9

Open thaddywu opened 3 years ago

thaddywu commented 3 years ago

Hi, I'm curious about whether SOK could handle inline data?

Though gcc and clang won't place any jump tables or constants in .text, there're invariantly some occasions in real-world projects where there exists interleaving data and code in the .text section. I tried to embed data into gaps of instructions using inline assembly. What I got is that SOK misidentifies those inline data bytes (from 0x40055f to 0x4005a7) as instructions. Given the following program attachments compiled by gcc -O0, SOK even throws an error. The root of this problem is because SOK wrongly takes data bytes as instructions.

For your convenience, I post the source code here. Log file and executable file are in attachments.

#include <stdio.h>
#include <stdlib.h>

int func() {
    int filter;
    asm volatile(
        "  leaq _filter(%%rip), %%rax\n\t"
        "  jmp _out\n\t"
        ".global _filter\n"
        ".type _filter,@object\n"
        "_filter:\n\t"
        ".ascii \""
        "\\040\\000\\000\\000\\000\\000\\000\\000"  // 0. BPF_STMT
        "\\025\\000\\000\\005\\015\\000\\000\\000"  // 1. BPF_JUMP
        "\\040\\000\\000\\000\\020\\000\\000\\000"  // 2. BPF_STMT
        "\\025\\000\\004\\000\\005\\000\\000\\000"  // 3. BPF_JUMP
        "\\025\\000\\003\\000\\012\\000\\000\\000"  // 4. BPF_JUMP
        "\\025\\000\\002\\000\\013\\000\\000\\000"  // 5. BPF_JUMP
        "\\025\\000\\001\\000\\004\\000\\000\\000"  // 6. BPF_JUMP
        "\\006\\000\\000\\000\\000\\000\\377\\177"  // 7. BPF_STMT
        "\\006\\000\\000\\000\\000\\000\\005\\000"  // 8. BPF_STME
        "\"\n\t"
        "_out:"
        : "=rax"(filter)
        :
        :);
    return filter;
}
int main() {
    printf("%d", func());
    return 0;
}

But even let the former problem alone, there may be some potential problems when handling with overlapping instructions.

Traceback (most recent call last): File "./extract_gt/extractBB.py", line 1213, in dumpGroundTruth(essInfo, module, outFile, options.binary, options.split) File "./extract_gt/extractBB.py", line 804, in dumpGroundTruth handleNotIncludedBB(pbModule) File "./extract_gt/extractBB.py", line 970, in handleNotIncludedBB addedBB2.size = bb.instructions[0].va + bb.instructions[0].size - overlapping_target ValueError: Value out of range: -5

No matter what, thanks so much for your amazing work!

bin2415 commented 3 years ago

Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.

In this example, below is the assembly result of assembly region:

        .bbInfo_INLINEB
#APP
# 6 "test.c" 1
          leaq _filter(%rip), %rax
          jmp _out
        .global _filter
.type _filter,@object
_filter:
        .ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000"
        _out:
# 0 "" 2
#NO_APP
        .bbInfo_INLINEE

We use .bbInfo_INLINEB and .bbinfo_INLINE to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!

ZhangZhuoSJTU commented 3 years ago

Hi, assembly codes are problems for our tools to collect ground truth, as compilers do not have basic block information for them. There are two categories of assembly codes: 1. assembly file 2. assembly codes in c file. Our solution is wrapping these regions with specific labels, and do recursive disassembly according to the control flows to identify code and data regions in assembly regions.

In this example, below is the assembly result of assembly region:

        .bbInfo_INLINEB
#APP
# 6 "test.c" 1
          leaq _filter(%rip), %rax
          jmp _out
        .global _filter
.type _filter,@object
_filter:
        .ascii "\040\000\000\000\000\000\000\000\025\000\000\005\015\000\000\000\040\000\000\000\020\000\000\000\025\000\004\000\005\000\000\000\025\000\003\000\012\000\000\000\025\000\002\000\013\000\000\000\025\000\001\000\004\000\000\000\006\000\000\000\000\000\377\177\006\000\000\000\000\000\005\000"
        _out:
# 0 "" 2
#NO_APP
        .bbInfo_INLINEE

We use .bbInfo_INLINEB and .bbinfo_INLINE to mark the start and end of the assembly regions. And we try to do recursively disassembling to identify the code and data regions. It seems that there exists bug to handle this region. Thanks for reporting!

Hi @bin2415 , thanks for your prompt reply. I am kind of curious why we need to use recursive disassembly to distinguish the code and data? Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte. Would it be easier to leverage such labels to identify the data/code regions? Please kindly correct me if I am wrong.

I do agree that we need to use recursively disassembly to get the basic block information, by the way 😆

bin2415 commented 3 years ago

Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte

Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.

For example, here(link1, link2) are the examples that .bytes represent specific instruction(s). Similar cases also exist in glibc.

ZhangZhuoSJTU commented 3 years ago

Based on my understanding, all the data in the assembly code would have some labels like .ascii or .byte

Hi @ZhangZhuoSJTU, that is a good observation and most cases meet this rule. But there exist some corner cases do not obey this rule as I know.

For example, here(link1, link2) are the examples that .bytes represent specific instruction(s). Similar corner cases also exists in glibc.

I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).

So I am wondering whether we can first follow the rule to get a superset of such inline-assemble data (i.e., the regions following .bytes/.ascii/... and between.bbInfo_INLINEB and .bbinfo_INLINE), and then use the linear disassembly to rule out some possible instructions (i.e., only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted).

I prefer linear disassembly rather than recursive disassembly. My observation here is that these specific instruction(s) represented by .bytes should be simple enough and should not contains control flow transfers (otherwise it would be unreasonable to hardcode them as bytes).

bin2415 commented 3 years ago

I see. I guess it means if we follow the rule, we would get a sound result for data identification (i.e., w/o false negative but w/ false positive).

I agree with that.

only a valid basic block occupying the whole data region can be regarded as instructions, and maybe more strong heuristics can be used here like only padding or ud2 is accepted

This should work. By the way, rep ret are often written in .byte xxxxxxx in some programs.