We are reporting two x86-64 assembler bugs we found, which are all relevant to Intel assembly syntax. The bugs were discovered while we manipulated the label names of toy assembly programs.
The first bug:
$ cat ./variant1.s
.intel_syntax noprefix
.text
or:
ret
call or
$ clang -masm=intel -o ./variant1.o -c ./variant1.s
./variant1.s:5:8: error: unknown token in expression
call or
^
Clang rejects this program because of the token or. Note that this program is generated from the below assembly program by changing the label name:
Unlike variant1.s, Clang can compile this program. However, it was indeed hard for me to find on the Internet why the name (or) matters. For example, a Wikipedia webpage (https://en.wikipedia.org/wiki/X86_assembly_language) lists several keywords but does not include or.
Surprisingly, or does not raise a problem in AT&T syntax. Please refer to the below program:
$ cat ./variant2.s
.text
or:
ret
call or
$ clang -masm=att -o ./variant2.o -c ./variant2.s
We thought this is a bug of Clang because (1) the one written in AT&T was accepted by Clang, and (2) there are no reasons to reject the case. Other usages of or (an instruction mnemonic, for example) cannot be applied to the argument of call instruction, and clearly there is a definition of the label or.
The second bug:
$ cat ./variant1.s
.intel_syntax noprefix
.data
rsp:
.long 1
.long 2
.long 3
.long 4
.text
lea rax, [rsp] // rsp here is intended to refer to a pointer in .data section
$ clang -masm=intel -o ./variant1.o -c ./variant1.s
$ objdump -d ./variant1.o
./variant1.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <.text>:
0: 48 8d 04 24 lea (%rsp),%rax
This bug is somewhat similar to the first bug, but has a different aspect. We'd better show the original assembly program to make it easy to understand this bug.
The code semantics of the original program is loading the pointer LABEL to the register rax. However, after we change the name of the label to rsp, which is an existing register name, the resulting program certainly has different code semantics. The binary code from Clang moves a value stored in the register rsp to rax.
The problem here is that even though there is an ambiguity in choosing the right target between the label rsp and the register rsp, Clang randomly chooses one of them, so the program has an unintended behavior.
Likewise, this issue will never happen with AT&T syntax. Please refer to the below code:
The label rsp is successfully transformed into a relocation entry in the object file.
We have seen two different situations where the names of labels can make Clang confused. We thought these are very interesting, as it is rather hard to strictly say that Clang is wrong.
We think there are two possibilities:
(1) Intel syntax rejects the use of an opcode name as a label, or
(2) Clang just mishandles the label.
In one sense, the ambiguity of Intel syntax (due to the absence of an official Intel assembly syntax manual) is the problem. For decades, many assemblers have been developed ad-hoc without any standards. So, it seems to be a hard decision problem to allow/deny several tokens or to choose the right usage.
On the other hand, Clang need to handle both two cases. They may reduce the usability and correctness of Clang. A user might want to write a function named or, but get rejected by Clang. A user might want to load a data pointer named rsp, but the resulting program loads a stack pointer, which can differ from the user's intention.
We suggest that Clang should compile the first case, and Clang should not compile the second case or should raise the alarm for the one.
Hi, I'm Soomin Kim from KAIST SoftSec Lab.
We are reporting two x86-64 assembler bugs we found, which are all relevant to Intel assembly syntax. The bugs were discovered while we manipulated the label names of toy assembly programs.
The first bug:
Clang rejects this program because of the token
or
. Note that this program is generated from the below assembly program by changing the label name:Unlike
variant1.s
, Clang can compile this program. However, it was indeed hard for me to find on the Internet why the name (or
) matters. For example, a Wikipedia webpage (https://en.wikipedia.org/wiki/X86_assembly_language) lists several keywords but does not includeor
.Surprisingly,
or
does not raise a problem in AT&T syntax. Please refer to the below program:We thought this is a bug of Clang because (1) the one written in AT&T was accepted by Clang, and (2) there are no reasons to reject the case. Other usages of
or
(an instruction mnemonic, for example) cannot be applied to the argument ofcall
instruction, and clearly there is a definition of the labelor
.The second bug:
This bug is somewhat similar to the first bug, but has a different aspect. We'd better show the original assembly program to make it easy to understand this bug.
The code semantics of the original program is loading the pointer
LABEL
to the registerrax
. However, after we change the name of the label torsp
, which is an existing register name, the resulting program certainly has different code semantics. The binary code from Clang moves a value stored in the registerrsp
torax
.The problem here is that even though there is an ambiguity in choosing the right target between the label
rsp
and the registerrsp
, Clang randomly chooses one of them, so the program has an unintended behavior.Likewise, this issue will never happen with AT&T syntax. Please refer to the below code:
The label
rsp
is successfully transformed into a relocation entry in the object file.We have seen two different situations where the names of labels can make Clang confused. We thought these are very interesting, as it is rather hard to strictly say that Clang is wrong.
We think there are two possibilities: (1) Intel syntax rejects the use of an opcode name as a label, or (2) Clang just mishandles the label.
In one sense, the ambiguity of Intel syntax (due to the absence of an official Intel assembly syntax manual) is the problem. For decades, many assemblers have been developed ad-hoc without any standards. So, it seems to be a hard decision problem to allow/deny several tokens or to choose the right usage.
On the other hand, Clang need to handle both two cases. They may reduce the usability and correctness of Clang. A user might want to write a function named
or
, but get rejected by Clang. A user might want to load a data pointer namedrsp
, but the resulting program loads a stack pointer, which can differ from the user's intention.We suggest that Clang should compile the first case, and Clang should not compile the second case or should raise the alarm for the one.