llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.53k stars 11.79k forks source link

Clang assembler has bugs in Intel syntax #62080

Open soomin-kim opened 1 year ago

soomin-kim commented 1 year ago

Hi, I'm Soomin Kim from KAIST SoftSec Lab.

We are reporting two x86-64 assembler bugs we found, which are all relevant to Intel assembly syntax. The bugs were discovered while we manipulated the label names of toy assembly programs.


The first bug:

$ cat ./variant1.s
.intel_syntax noprefix
.text
or:
ret
call or
$ clang -masm=intel -o ./variant1.o -c ./variant1.s
./variant1.s:5:8: error: unknown token in expression
call or
       ^

Clang rejects this program because of the token or. Note that this program is generated from the below assembly program by changing the label name:

$ cat ./normal1.s
.intel_syntax noprefix
.text
LABEL:
ret
call LABEL
$ clang -masm=intel -o ./normal1.o -c ./normal1.s

Unlike variant1.s, Clang can compile this program. However, it was indeed hard for me to find on the Internet why the name (or) matters. For example, a Wikipedia webpage (https://en.wikipedia.org/wiki/X86_assembly_language) lists several keywords but does not include or.

Surprisingly, or does not raise a problem in AT&T syntax. Please refer to the below program:

$ cat ./variant2.s
.text
or:
ret
call or
$ clang -masm=att -o ./variant2.o -c ./variant2.s

We thought this is a bug of Clang because (1) the one written in AT&T was accepted by Clang, and (2) there are no reasons to reject the case. Other usages of or (an instruction mnemonic, for example) cannot be applied to the argument of call instruction, and clearly there is a definition of the label or.


The second bug:

$ cat ./variant1.s
.intel_syntax noprefix

.data
rsp:
.long 1
.long 2
.long 3
.long 4

.text
lea rax, [rsp] // rsp here is intended to refer to a pointer in .data section
$ clang -masm=intel -o ./variant1.o -c ./variant1.s
$ objdump -d ./variant1.o
./variant1.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   48 8d 04 24             lea    (%rsp),%rax

This bug is somewhat similar to the first bug, but has a different aspect. We'd better show the original assembly program to make it easy to understand this bug.

$ cat ./normal.s
.intel_syntax noprefix

.data
LABEL:
.long 1
.long 2
.long 3
.long 4

.text
lea rax, [LABEL]
$ clang -masm=intel -o ./normal1.o -c ./normal1.s

The code semantics of the original program is loading the pointer LABEL to the register rax. However, after we change the name of the label to rsp, which is an existing register name, the resulting program certainly has different code semantics. The binary code from Clang moves a value stored in the register rsp to rax.

The problem here is that even though there is an ambiguity in choosing the right target between the label rsp and the register rsp, Clang randomly chooses one of them, so the program has an unintended behavior.

Likewise, this issue will never happen with AT&T syntax. Please refer to the below code:

$ cat ./variant2.s
.data
rsp:
.long 1
.long 2
.long 3
.long 4

.text
leaq (rsp), %rax
$ clang -masm=intel -o ./variant2.o -c ./variant2.s
$ objdump -d ./variant2.o

./variant2.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   48 8d 04 25 00 00 00 00         lea    0x0,%rax

The label rsp is successfully transformed into a relocation entry in the object file.


We have seen two different situations where the names of labels can make Clang confused. We thought these are very interesting, as it is rather hard to strictly say that Clang is wrong.

We think there are two possibilities: (1) Intel syntax rejects the use of an opcode name as a label, or (2) Clang just mishandles the label.

In one sense, the ambiguity of Intel syntax (due to the absence of an official Intel assembly syntax manual) is the problem. For decades, many assemblers have been developed ad-hoc without any standards. So, it seems to be a hard decision problem to allow/deny several tokens or to choose the right usage.

On the other hand, Clang need to handle both two cases. They may reduce the usability and correctness of Clang. A user might want to write a function named or, but get rejected by Clang. A user might want to load a data pointer named rsp, but the resulting program loads a stack pointer, which can differ from the user's intention.

We suggest that Clang should compile the first case, and Clang should not compile the second case or should raise the alarm for the one.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

KanRobert commented 1 year ago

I plan to refactor X86 assembly parser recently and will fix it by the way.