Closed s117 closed 10 months ago
If you examine the invalid memory address that the program is trying to read from (0x680a21303320658c), it is like a part of the output stream:
$ echo '0x680a21303320658c' | tac -r --separator='..' | xxd -r -p | xxd
00000000: 8c65 2033 3021 0a68 .e 30!.h
Actually, by looking into the dumped execution stream, a3 was calculated from a2 plus an offset:
S/349826 C/349826 I/349786 PC/0x00000000000182d8 (0x00179693) slli a3, a5, 1
RS0/a5 0x0000000f
RD/a3 0x0000001e
S/349827 C/349827 I/349787 PC/0x00000000000182dc (0x00d606b3) add a3, a2, a3
RS0/a2 0x680a21303320656e
RS1/a3 0x0000001e
RD/a3 0x680a21303320658c
S/349828 C/349828 I/349788 PC/0x00000000000182e0 (0x0006d583) lhu a1, 0(a3)
RS0/a3 0x680a21303320658c
ADDR 0x680a21303320658c
EXCEPTION 0x0000000000000001
EVEC 0x0000000000004418
ECAUSE 0x000000000000000a
EPC 0x00000000000182e0
SR 0x000000f9
And the value of a2 is really a part of the output stream: (... li)ne 30!\nh
(ello ...)
$ echo '680a21303320656e' | tac -r --separator='..' | xxd -r -p | xxd
00000000: 6e65 2033 3021 0a68 ne 30!.h
The problem is caused by the different behavior of sys_brk(0)
in Linux and Proxy Kernel:
%a3
is supposed to hold a pointer to tcache->counts[tc_idx], which can be observed from disassembly:
000000000001824c <_int_free>:
...
if (tcache != NULL && tc_idx < mp_.tcache_bins)
1828c: 02020793 add a5,tp,32 # 20 <thread_arena>
18290: 0087b603 ld a2,8(a5)
18294: 03913c23 sd s9,56(sp)
18298: 03b13423 sd s11,40(sp)
1829c: 00050c93 mv s9,a0
182a0: 40060063 beqz a2,186a0 <_int_free+0x454>
182a4: 000dc597 auipc a1,0xdc
182a8: d8458593 add a1,a1,-636 # f4028 <mp_>
182ac: 0685b503 ld a0,104(a1)
size_t tc_idx = csize2tidx (size);
182b0: fef48793 add a5,s1,-17
182b4: 0047d793 srl a5,a5,0x4
if (__glibc_unlikely (e->key == tcache_key))
182b8: 000e2d97 auipc s11,0xe2
182bc: 710d8d93 add s11,s11,1808 # fa9c8 <perturb_byte>
if (tcache != NULL && tc_idx < mp_.tcache_bins)
182c0: 02a7f463 bgeu a5,a0,182e8 <_int_free+0x9c>
if (__glibc_unlikely (e->key == tcache_key))
182c4: 008db883 ld a7,8(s11)
182c8: 01843303 ld t1,24(s0)
if (cnt >= mp_.tcache_count)
182cc: 0785b503 ld a0,120(a1)
tcache_entry *e = (tcache_entry *) chunk2mem (p);
182d0: 01040813 add a6,s0,16
if (__glibc_unlikely (e->key == tcache_key))
182d4: 73130863 beq t1,a7,18a04 <_int_free+0x7b8>
if (tcache->counts[tc_idx] < mp_.tcache_count)
182d8: 00179693 sll a3,a5,0x1
182dc: 00d606b3 add a3,a2,a3
182e0: 0006d583 lhu a1,0(a3)
182e4: 42a5e263 bltu a1,a0,18708 <_int_free+0x4bc>
...
In the source, tcache
is a thread-local variable defined as:static __thread tcache_perthread_struct *tcache = NULL;
. The computation chain of its address also indicates it is a thread-local thing: it is loaded from a location pointed by the %tp
register (thread pointer pointing to the TLS block):
S/349807 C/349807 I/349767 PC/0x000000000001828c (0x02020793) addi a5, tp, 32
RS0/tp 0x000fb9a0
RD/a5 0x000fb9c0
S/349808 C/349808 I/349768 PC/0x0000000000018290 (0x0087b603) ld a2, 8(a5)
RS0/a5 0x000fb9c0
RD/a2 0x680a21303320656e
ADDR 0x000fb9c8
...
S/349827 C/349827 I/349787 PC/0x00000000000182dc (0x00d606b3) add a3, a2, a3
RS0/a2 0x680a21303320656e
RS1/a3 0x0000001e
RD/a3 0x680a21303320658c
S/349828 C/349828 I/349788 PC/0x00000000000182e0 (0x0006d583) lhu a1, 0(a3)
RS0/a3 0x680a21303320658c
ADDR 0x680a21303320658c
EXCEPTION 0x0000000000000001
EVEC 0x0000000000004418
ECAUSE 0x000000000000000a
EPC 0x00000000000182e0
SR 0x000000f9
So, definitely, the pointer value stored in the TLS block has been clobbered. To find who messed up this region, the trace was searched in the reserve order starting from the offending instruction (lhu a1,0(a3) @ 0x182e0
) for all the memory accesses referencing the impacted region, i.e., the range [0x000fb9c8, 0x000fb9dd0)
. See below, I/<seq>
is the sequence number of an instruction in the dynamic instruction stream:
S/349808 C/349808 I/349768 PC/0x0000000000018290 (0x0087b603) ld a2, 8(a5)
RS0/a5 0x000fb9c0
RD/a2 0x680a21303320656e
ADDR 0x000fb9c8
S/349783 C/349783 I/349743 PC/0x000000000001b788 (0x00873683) ld a3, 8(a4)
RS0/a4 0x000fb9c0
RD/a3 0x680a21303320656e
ADDR 0x000fb9c8
S/300383 C/300383 I/300343 PC/0x000000000001d148 (0xfed70fa3) sb a3, -1(a4)
RS0/a4 0x000fb9d0
RS1/a3 0x00000068
ADDR 0x000fb9cf
S/300065 C/300065 I/300025 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9ce
RS1/a2 0x0000000a
ADDR 0x000fb9ce
S/300060 C/300060 I/300020 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9cd
RS1/a2 0x00000021
ADDR 0x000fb9cd
S/299946 C/299946 I/299906 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9cc
RS1/a2 0x00000030
ADDR 0x000fb9cc
S/299941 C/299941 I/299901 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9cb
RS1/a2 0x00000033
ADDR 0x000fb9cb
S/299730 C/299730 I/299690 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9ca
RS1/a2 0x00000020
ADDR 0x000fb9ca
S/299725 C/299725 I/299685 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9c9
RS1/a2 0x00000065
ADDR 0x000fb9c9
S/299720 C/299720 I/299680 PC/0x000000000001d1a4 (0x00c68023) sb a2, 0(a3)
RS0/a3 0x000fb9c8
RS1/a2 0x0000006e
ADDR 0x000fb9c8
S/273680 C/273680 I/273641 PC/0x0000000000019770 (0x0087b783) ld a5, 8(a5)
RS0/a5 0x000fb9c0
RD/a5 0x000fb250
ADDR 0x000fb9c8
S/273632 C/273632 I/273593 PC/0x000000000001b07c (0x00843783) ld a5, 8(s0)
RS0/s0 0x000fb9c0
RD/a5 0x000fb250
ADDR 0x000fb9c8
...
So before I/273641 PC/0x0000000000019770
the tcache
pointer kept in the TLS block still looks valid. The culprit is a bunch of byte store instructions with PC 0x1d1a4
and 0x1d148
. The reconstructed call graph shows those instructions belong to the memcpy
function called by fprintf
, which was called by main
to print lines to the standard output:
The call graph and instruction trace reveals the IO output buffer (for stdout and used by the fprintf
) is [0x000fb710, 0x000fc710)
, which overlaps the region [0x000fb9c8, 0x000fb9dd0)
for keeping the tcache
pointer in the TLS block (related files:glibc/stdio-common/fprintf.c:32
, glibc/stdio-common/vfprintf-internal.c:1522
, glibc/stdio-common/printf_buffer_to_file.c:30
, glibc/stdio-common/vfprintf-internal.c:1523
). And this is why the tcache
pointer stored in the TLB block was replaced by the strings to be printed to the stdout.
To dig out why the IO output buffer overlaps the TLS region, we need to locate the codes allocating the IO buffer and TLS block. Searching the first occurrence of 0x000fb710
(IO buffer for stdout) and 0x000fb9a0
(the TLS block pointed by register %tp
) reveals the origin of those addresses:
stdout
and used by fprintf
is allocated by a call to malloc:%tp
is allocated by a call to _dl_early_allocate
:Now, the real question is why _dl_early_allocate
and malloc
call will hand out memory regions that overlap with each other.
What happened here is _dl_early_allocate
was called during the glibc start-up phase to allocate space for the TLS block. At this moment, __curbrk
(a global variable used by glibc to track the current brk address of the main heap) has not been initialized yet. _dl_early_allocate
in this case will not initialize __curbrk
and will let the first call to brk in the future take care of the current brk variable initialization. It will just issue raw sys_brk(0)
system call first to obtain the current brk address, then add it with the requested memory size, and finally make another sys_brk
call to allocate the requested amount of space:
_dl_early_allocate (size_t size)
{
...
if (__curbrk != NULL)
...
else
{
/* If brk has not been invoked, there is no need to update
__curbrk. The first call to brk will take care of that. */
void *previous = __brk_call (0);
result = __brk_call (previous + size);
if (result == previous)
result = NULL;
else
result = previous;
}
...
return result;
}
Later, when the glibc start-up routine begins to initialize the internal of malloc
, __sbrk
will be called the first time and initialize __curbrk
.
This is done by by calling __brk
function with argument 0
. __brk
is a glibc wrapper function for sys_brk
system call. It updates __curbrk
variable with the return value of sys_brk
. Here comes the problem:
On Linux, this works fine because sys_brk(0)
will do nothing and just return the address of the current brk because the new brk address, zero, is invalid.
On Proxy Kernel, this is why the overlapping happened: because PK's sys_brk(0)
can change the brk address: it moves the brk address to the minimal possible address (e.g., right above the .bss
section) and return the brk address whatever after the change. So, all the allocations made by _dl_early_allocate
before initializing __curbrk
have the same address (starting right after .bss), and the internal of malloc
is also set to start allocating memory right after .bss
section.
Machine code for _dl_early_allocate
:
void *
_dl_early_allocate (size_t size)
{
20e18: ff010113 add sp,sp,-16
20e1c: 00813023 sd s0,0(sp)
20e20: 00113423 sd ra,8(sp)
void *result;
if (__curbrk != NULL)
20e24: 000da797 auipc a5,0xda
20e28: c047b783 ld a5,-1020(a5) # faa28 <___brk_addr>
{
20e2c: 00050413 mv s0,a0
if (__curbrk != NULL)
20e30: 02078663 beqz a5,20e5c <_dl_early_allocate+0x44>
/* If the break has been initialized, brk must have run before,
so just call it once more. */
{
result = __sbrk (size);
20e34: a14fe0ef jal 1f048 <__sbrk>
if (result == (void *) -1)
20e38: fff00713 li a4,-1
result = __sbrk (size);
20e3c: 00050793 mv a5,a0
if (result == (void *) -1)
20e40: 02e50c63 beq a0,a4,20e78 <_dl_early_allocate+0x60>
}
/* If brk fails, fall back to mmap. This can happen due to
unfortunate ASLR layout decisions and kernel bugs, particularly
for static PIE. */
if (result == NULL)
20e44: 02078a63 beqz a5,20e78 <_dl_early_allocate+0x60>
else
result = (void *) ret;
}
return result;
}
20e48: 00813083 ld ra,8(sp)
20e4c: 00013403 ld s0,0(sp)
20e50: 00078513 mv a0,a5
20e54: 01010113 add sp,sp,16
20e58: 00008067 ret
Execution trace (the return value of _dl_early_allocate
is 0x000fb238
):
S/146786 C/146786 I/146779 PC/0x0000000000020e50 (0x00078513) mv a0, a5
RS0/a5 0x000fb238
RD/a0 0x000fb238
S/146787 C/146787 I/146780 PC/0x0000000000020e54 (0x01010113) addi sp, sp, 16
RS0/sp 0x0ff6dcc0
RD/sp 0x0ff6dcd0
S/146788 C/146788 I/146781 PC/0x0000000000020e58 (0x00008067) ret
RS0/ra 0x00010c1c
TAKEN_PC 0x00010c1c
The program header of helloworld.rv64 is:
$ riscv64-unknown-linux-gnu-readelf -l helloworld.rv64
Elf file type is EXEC (Executable file)
Entry point 0x10564
There are 7 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
RISCV_ATTRIBUT 0x00000000000e3d35 0x0000000000000000 0x0000000000000000
0x0000000000000046 0x0000000000000000 R 0x1
LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000
0x000000000007e086 0x000000000007e086 R E 0x2000
LOAD 0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
0x0000000000065128 0x000000000006a658 RW 0x2000
NOTE 0x00000000000001c8 0x00000000000101c8 0x00000000000101c8
0x0000000000000020 0x0000000000000020 R 0x4
TLS 0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
0x0000000000000018 0x0000000000000058 R 0x8
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
0x0000000000063420 0x0000000000063420 R 0x1
Section to Segment mapping:
Segment Sections...
00 .riscv.attributes
01 .note.ABI-tag .rela.dyn .text .rodata .eh_frame .gcc_except_table
02 .tdata .preinit_array .init_array .fini_array .data.rel.ro .data .got .sdata .bss
03 .note.ABI-tag
04 .tdata .tbss
05
06 .tdata .preinit_array .init_array .fini_array .data.rel.ro
That means the image layout is:
.text: 0x0000000000010000 - 0x000000000008e086
.data: 0x0000000000090be0 - 0x00000000000f5d08
.bss: 0x00000000000f5d08 - 0x00000000000fb238
The previous toolchain port (GCC 9.2.0/glibc 2.29) did not experience a similar crash because this version of _dl_early_allocate
function was implemented in 2022, and glibc 2.29 in the previous port will just initialize the TLS block with a direct call to __sbrk
(which initializes __curbrk
immediately).
One possible fix is making the proxy kernel's sys_brk
behave like the Linux version, at least when the new brk is 0x0
it should do nothing and only return the current brk. Patch 5a31996 implemented this behavior and has been tested effective.
Consider the sample program below:
Save it to
helloword.c
, and compile it with the command:Then run it with
spike
build from riscv-isa-sim @ e69ca83 and pk build from riscv-pk @ c2af001. So far so good:However, if you run spike with standard output redirection, the RV64 program will crash: