Open aeflores opened 1 month ago
Hello, the first two bugs arose from improper handling of duplicate sections.
The __cxx_global_var_init.8 was introduced following the compilation of step-14.cc, and the disassembly results are presented below:
Disassembly of section .text.startup:
0000000000000000 <__cxx_global_var_init.7>:
0: 50 push %rax
1: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # 8 <__cxx_global_var_init.7+0x8>
8: 75 2e jne 38 <__cxx_global_var_init.7+0x38>
a: bf 00 00 00 00 mov $0x0,%edi
f: be 01 00 00 00 mov $0x1,%esi
14: e8 00 00 00 00 callq 19 <__cxx_global_var_init.7+0x19>
19: bf 00 00 00 00 mov $0x0,%edi
1e: be 00 00 00 00 mov $0x0,%esi
23: ba 00 00 00 00 mov $0x0,%edx
28: e8 00 00 00 00 callq 2d <__cxx_global_var_init.7+0x2d>
2d: 48 c7 05 00 00 00 00 movq $0x1,0x0(%rip) # 38 <__cxx_global_var_init.7+0x38>
34: 01 00 00 00
38: 58 pop %rax
39: c3 retq
Disassembly of section .text.startup:
0000000000000000 <__cxx_global_var_init.8>:
0: 50 push %rax
1: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # 8 <__cxx_global_var_init.8+0x8>
8: 75 4c jne 56 <__cxx_global_var_init.8+0x56>
a: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 12 <__cxx_global_var_init.8+0x12>
11: 00
12: bf 00 00 00 00 mov $0x0,%edi
17: be 01 00 00 00 mov $0x1,%esi
1c: e8 00 00 00 00 callq 21 <__cxx_global_var_init.8+0x21>
21: 48 c7 05 00 00 00 00 movq $0x0,0x0(%rip) # 2c <__cxx_global_var_init.8+0x2c>
28: 00 00 00 00
2c: 48 c7 05 00 00 00 00 movq $0x0,0x0(%rip) # 37 <__cxx_global_var_init.8+0x37>
33: 00 00 00 00
37: bf 00 00 00 00 mov $0x0,%edi
3c: be 00 00 00 00 mov $0x0,%esi
41: ba 00 00 00 00 mov $0x0,%edx
46: e8 00 00 00 00 callq 4b <__cxx_global_var_init.8+0x4b>
4b: 48 c7 05 00 00 00 00 movq $0x1,0x0(%rip) # 56 <__cxx_global_var_init.8+0x56>
52: 01 00 00 00
56: 58 pop %rax
57: c3 retq
The ground truth is correct after compling step-14.cc
:
BBL# 293 ( 10B) [BBL] - Off:0x0000, Fixups: 2, padding: 0, FallThrough: Y (@Sec .text.startup) [DUP]
BBL# 294 ( 46B) [BBL] - Off:0x000a, Fixups: 7, padding: 0, FallThrough: Y (@Sec .text.startup)
BBL# 295 ( 2B) [FUN&&OBJ] - Off:0x0038, Fixups: 0, padding: 0, FallThrough: N (@Sec .text.startup)
BBL# 296 ( 10B) [BBL] - Off:0x0000, Fixups: 2, padding: 0, FallThrough: Y (@Sec .text.startup) [DUP]
BBL# 297 ( 76B) [BBL] - Off:0x000a, Fixups: 12, padding: 0, FallThrough: Y (@Sec .text.startup) [DUP]
BBL# 298 ( 2B) [FUN&&OBJ] - Off:0x0056, Fixups: 0, padding: 0, FallThrough: N (@Sec .text.startup) [DUP]
BBL# 299 ( 10B) [BBL] - Off:0x0000, Fixups: 2, padding: 0, FallThrough: Y (@Sec .text.startup) [DUP]
BBL# 300 ( 46B) [BBL] - Off:0x000a, Fixups: 7, padding: 0, FallThrough: Y (@Sec .text.startup) [DUP]
BBL# 301 ( 2B) [FUN&&OBJ] - Off:0x0038, Fixups: 0, padding: 0, FallThrough: N (@Sec .text.startup) [DUP]
However, step-14.o
contains duplicate sections(.text.startup), and the linker fails to handle them appropriately.
Note that there are four special sections that need to be handled. The compiled dealII.zip after fixing is attached.
Hello, for the third case, we identified that the root cause is the incorrect handling of the .inst 0xdeff
pseudo code. Instead of generating the instruction udf #255
, the GCC compiler emits the .inst pseudo code. Below is the example:
bl bfd_assert
movs r3, #0
ldr r3, [r3, #360]
.bbInfo_BE 0
.inst 0xdeff
.bbInfo_FUNE
We are planing to handle the corner case in gas assembler.
Thanks a lot for the quick response! I guess that means some of the binaries need to be rebuilt. Right? Do you have by any chance scripts for rebuilding the complete datasets (https://zenodo.org/record/6566082/)? I've looked around but all the scripts seem to assume the binaries have been prebuilt.
For completeness (for other people using this dataset), some arm (non-thumb) binaries have also problems with udf
. In that case with udf #0
instead of udf 255
. Most versions of dwp
and one version of ls.gold
.
For each binary, I provide a snippet of assembly where "GOOD" instructions are instructions present in the ground truth, and "BAD" instructions are absent. In all of these, "GOOD" instructions must fallthrough to "BAD" instructions, which tells me something wrong is going on with the ground truth.
GOOD 0x2736b8 mov r3, #0
GOOD 0x2736bc str r3, [sp, #4]
GOOD 0x2736c0 ldr r3, [r3, #8]
BAD 0x2736c4 udf #0
GOOD 0x233578 mov r3, #0
GOOD 0x23357c str r3, [sp, #4]
GOOD 0x233580 ldr r3, [r3, #8]
BAD 0x233584 udf #0
Here is a list of all the other binaries where I have found inconsistent ground truth. Some of these might be due to the duplicate section problem or maybe there is something else going on.
GOOD 0x414138 fmov s0, wzr
BAD 0x41413c ret
GOOD 0x49ab00 fmov d0, xzr
BAD 0x49ab04 ret
GOOD 0x4f4344 fmov d0, xzr
BAD 0x4f4348 ret
GOOD 0x587960 fmov d0, xzr
BAD 0x587964 ret
cpu2006/clang_m32_Os/dealII_base.amd64-m32-ccr-Os
GOOD 0x81c1827 add byte ptr [ebx + 0x31042444], cl
BAD 0x81c182d leave
GOOD 0x81c182e mov dword ptr [eax], 0x8240954
libs/clang_m32_Of/libxml2.so
GOOD 0x59045 pop ecx
GOOD 0x59046 mov eax, 0xffffffff
GOOD 0x5904b add ecx, 0x116faf
GOOD 0x59051 mov edx, dword ptr [ecx + 0x9ac]
BAD 0x59057 cmp edx, 0xe
BAD 0x5905a jg 0x590a7
libs/gcc_m32_Os/stlOfXIA
GOOD 0xcf851 lea ecx, [ebx + ebx]
GOOD 0xcf854 shr eax, cl
BAD 0xcf856 and eax, 3
/libs/clang_m32_O1/libpcap.so.1.9.0
GOOD 0x60d5 pop eax
BAD 0x60d6 add eax, 0x33f1f
GOOD 0x60dc cmp ecx, 4
/libs/clang_m32_Of/libpcap.so.1.9.0
GOOD 0xf4b6 mov dword ptr [ecx + 0x30], edx
GOOD 0xf4b9 mov dword ptr [eax + 0x2cc], ecx
BAD 0xf4bf ret
cpu2006/clang_Of/dealII_base.amd64-m64-ccr-Of
In this case there is an address in the ground truth at 0x403be5
but if we look at the disassembly of the unstripped version, there is a symbol at 403be0
and the following assembly code:
0000000000403be0 <_GLOBAL__sub_I_sparse_matrix_ez.float.cc>:
403be0: 50 push %rax
403be1: bf 4b cd 78 00 mov $0x78cd4b,%edi
403be6: e8 95 f3 ff ff callq 402f80 <_ZNSt8ios_base4InitC1Ev@plt>
403beb: bf 90 2f 40 00 mov $0x402f90,%edi
I think it is very unlikely that 0x403be5
is a real instruction.
I've noticed a few places where the ground truth seems to be wrong. This is, I believe, and different case from https://github.com/junxzm1990/x86-sok/issues/28 where capstone was to blame. In the following cases, the
extract_gt
script reports errors.intel_executables/cpu2006/clang_Os/dealII_base.amd64-m64-ccr-Os
There is a fragment of code that looks like this:
But
404043: call _ZN16ConstantFunctionILi3EEC1Edj
is not part of the ground truth, even though40403e
belongs to it. Theextract_gt
script produces the following output:What could cause this instruction to be missing?
milc_base.aarch64-ccr-O2
The
4168c0: ret
instruction is missing from the ground truth (even though4168bc: fmov d0,xzr
is present).The
extract_gt
log:Same, what could be going wrong here? One thing I notice is that the basic block seems to be duplicated in these two examples. E.g. in the latter, we have
BBL#2621
andBBL#2622
with the same boundaries. Could that have something to do?soplex_base.arm32-gcc81-mthumb_final-O2
Here we have the following snippet:
The
udf
instruction is missing from the ground truth, even though18234: ldr r3, [r3]
is present.This pattern is generated when there is a null pointer access (see e.g. https://embedded.fm/blog/2017/3/6/exceptional-code) The
extract_gt
log contains the following:This pattern in particular happens in a lot of the thumb binaries and the
udf
instruction seems to be missing every time (in close to 200 binaries). I would guess this is a different issue than the previous two.