RV64 program compiled by gcc-13.2.0 (with glibc2.38) crash if using standard output redirection

Consider the sample program below:

#include <stdio.h>

int main()
{
    for (int i = 0; i < 99; i++)
        fprintf(stdout, "hello world - line %d!\n", i);
    return 0;
}

Save it to helloword.c, and compile it with the command:

riscv64-unknown-linux-gnu-gcc -static -Wl,-Ttext-segment,0x10000 -g -O0 -o ./helloworld.rv64 helloworld.c

Then run it with spike build from riscv-isa-sim @ e69ca83 and pk build from riscv-pk @ c2af001. So far so good:

$ spike -m256 pk helloworld.rv64
Requesting target memory 0x10000000
******* Resetting core ********** 
****Initializing the processor system****
******* Resetting core ********** 
******* Resetting core ********** 
****Initialization complete****
hello world - line 0!
hello world - line 1!
hello world - line 2!
hello world - line 3!
hello world - line 4!
hello world - line 5!
hello world - line 6!
hello world - line 7!
hello world - line 8!
hello world - line 9!
hello world - line 10!
hello world - line 11!
hello world - line 12!
hello world - line 13!
hello world - line 14!
hello world - line 15!
hello world - line 16!
hello world - line 17!
hello world - line 18!
hello world - line 19!
hello world - line 20!
hello world - line 21!
hello world - line 22!
hello world - line 23!
hello world - line 24!
hello world - line 25!
hello world - line 26!
hello world - line 27!
hello world - line 28!
hello world - line 29!
hello world - line 30!
hello world - line 31!
hello world - line 32!
hello world - line 33!
hello world - line 34!
hello world - line 35!
hello world - line 36!
hello world - line 37!
hello world - line 38!
hello world - line 39!
hello world - line 40!
hello world - line 41!
hello world - line 42!
hello world - line 43!
hello world - line 44!
hello world - line 45!
hello world - line 46!
hello world - line 47!
hello world - line 48!
hello world - line 49!
hello world - line 50!
hello world - line 51!
hello world - line 52!
hello world - line 53!
hello world - line 54!
hello world - line 55!
hello world - line 56!
hello world - line 57!
hello world - line 58!
hello world - line 59!
hello world - line 60!
hello world - line 61!
hello world - line 62!
hello world - line 63!
hello world - line 64!
hello world - line 65!
hello world - line 66!
hello world - line 67!
hello world - line 68!
hello world - line 69!
hello world - line 70!
hello world - line 71!
hello world - line 72!
hello world - line 73!
hello world - line 74!
hello world - line 75!
hello world - line 76!
hello world - line 77!
hello world - line 78!
hello world - line 79!
hello world - line 80!
hello world - line 81!
hello world - line 82!
hello world - line 83!
hello world - line 84!
hello world - line 85!
hello world - line 86!
hello world - line 87!
hello world - line 88!
hello world - line 89!
hello world - line 90!
hello world - line 91!
hello world - line 92!
hello world - line 93!
hello world - line 94!
hello world - line 95!
hello world - line 96!
hello world - line 97!
hello world - line 98!
******* Resetting core **********

However, if you run spike with standard output redirection, the RV64 program will crash:

$ spike -m256 pk helloworld.rv64 | cat
Requesting target memory 0x10000000
******* Resetting core ********** 
****Initializing the processor system****
******* Resetting core ********** 
******* Resetting core ********** 
****Initialization complete****
z  0000000000000000 ra 000000000001b7d4 sp 000000000ff6da80 gp 00000000000f6500
tp 00000000000fb9a0 t0 0000000000000001 t1 0000000200000001 t2 0000000000000000
s0 00000000000fb5f0 s1 0000000000000110 a0 0000000000000007 a1 00000000000f4028
a2 680a21303320656e a3 680a21303320658c a4 0000000000000111 a5 000000000000000f
a6 00000000000fb600 a7 7f2443cebe3590c0 s2 000000006c6c6568 s3 0000000000000004
s4 0000000000000001 s5 0000000000000001 s6 0000000000000000 s7 0000000000000002
s8 0000000000000000 s9 00000000000f40c0 sA 00000000000f65b0 sB 00000000000fa9c8
t3 0000000000000000 t4 0000000000000000 t5 000000000008e07e t6 0000000000010000
******* Resetting core ********** 
sr 00000000000000f9 pc 00000000000182ec va 680a21303320658c insn       ffffffff
User load segfault @ 0x680a21303320658c

$ spike -m256 pk helloworld.rv64 > stdout.txt                    
Requesting target memory 0x10000000
******* Resetting core ********** 
****Initializing the processor system****
******* Resetting core ********** 
******* Resetting core ********** 
****Initialization complete****
******* Resetting core ********** 

$ cat stdout.txt           
z  0000000000000000 ra 000000000001b7d4 sp 000000000ff6da80 gp 00000000000f6500
tp 00000000000fb9a0 t0 0000000000000001 t1 0000000200000001 t2 0000000000000000
s0 00000000000fb5f0 s1 0000000000000110 a0 0000000000000007 a1 00000000000f4028
a2 680a21303320656e a3 680a21303320658c a4 0000000000000111 a5 000000000000000f
a6 00000000000fb600 a7 7f2443cebe3590c0 s2 000000006c6c6568 s3 0000000000000004
s4 0000000000000001 s5 0000000000000001 s6 0000000000000000 s7 0000000000000002
s8 0000000000000000 s9 00000000000f40c0 sA 00000000000f65b0 sB 00000000000fa9c8
t3 0000000000000000 t4 0000000000000000 t5 000000000008e07e t6 0000000000010000
sr 00000000000000f9 pc 00000000000182ec va 680a21303320658c insn       ffffffff
User load segfault @ 0x680a21303320658c

TL;DR.

The problem is caused by the different behavior of sys_brk(0) in Linux and Proxy Kernel:

Analysis

%a3 is supposed to hold a pointer to tcache->counts[tc_idx], which can be observed from disassembly:

000000000001824c <_int_free>:
...
    if (tcache != NULL && tc_idx < mp_.tcache_bins)
   1828c:   02020793            add a5,tp,32 # 20 <thread_arena>
   18290:   0087b603            ld  a2,8(a5)
   18294:   03913c23            sd  s9,56(sp)
   18298:   03b13423            sd  s11,40(sp)
   1829c:   00050c93            mv  s9,a0
   182a0:   40060063            beqz    a2,186a0 <_int_free+0x454>
   182a4:   000dc597            auipc   a1,0xdc
   182a8:   d8458593            add a1,a1,-636 # f4028 <mp_>
   182ac:   0685b503            ld  a0,104(a1)
    size_t tc_idx = csize2tidx (size);
   182b0:   fef48793            add a5,s1,-17
   182b4:   0047d793            srl a5,a5,0x4
    if (__glibc_unlikely (e->key == tcache_key))
   182b8:   000e2d97            auipc   s11,0xe2
   182bc:   710d8d93            add s11,s11,1808 # fa9c8 <perturb_byte>
    if (tcache != NULL && tc_idx < mp_.tcache_bins)
   182c0:   02a7f463            bgeu    a5,a0,182e8 <_int_free+0x9c>
    if (__glibc_unlikely (e->key == tcache_key))
   182c4:   008db883            ld  a7,8(s11)
   182c8:   01843303            ld  t1,24(s0)
        if (cnt >= mp_.tcache_count)
   182cc:   0785b503            ld  a0,120(a1)
    tcache_entry *e = (tcache_entry *) chunk2mem (p);
   182d0:   01040813            add a6,s0,16
    if (__glibc_unlikely (e->key == tcache_key))
   182d4:   73130863            beq t1,a7,18a04 <_int_free+0x7b8>
    if (tcache->counts[tc_idx] < mp_.tcache_count)
   182d8:   00179693            sll a3,a5,0x1
   182dc:   00d606b3            add a3,a2,a3
   182e0:   0006d583            lhu a1,0(a3)
   182e4:   42a5e263            bltu    a1,a0,18708 <_int_free+0x4bc>
...

In the source, tcache is a thread-local variable defined as:static __thread tcache_perthread_struct *tcache = NULL;. The computation chain of its address also indicates it is a thread-local thing: it is loaded from a location pointed by the %tp register (thread pointer pointing to the TLS block):

S/349807 C/349807 I/349767 PC/0x000000000001828c (0x02020793) addi    a5, tp, 32
    RS0/tp  0x000fb9a0
    RD/a5   0x000fb9c0

S/349808 C/349808 I/349768 PC/0x0000000000018290 (0x0087b603) ld      a2, 8(a5)
    RS0/a5  0x000fb9c0
    RD/a2   0x680a21303320656e
    ADDR    0x000fb9c8

...

S/349827 C/349827 I/349787 PC/0x00000000000182dc (0x00d606b3) add     a3, a2, a3
    RS0/a2  0x680a21303320656e
    RS1/a3  0x0000001e
    RD/a3   0x680a21303320658c

S/349828 C/349828 I/349788 PC/0x00000000000182e0 (0x0006d583) lhu     a1, 0(a3)
    RS0/a3  0x680a21303320658c
    ADDR    0x680a21303320658c
    EXCEPTION   0x0000000000000001
    EVEC    0x0000000000004418
    ECAUSE  0x000000000000000a
    EPC 0x00000000000182e0
    SR  0x000000f9

So, definitely, the pointer value stored in the TLS block has been clobbered. To find who messed up this region, the trace was searched in the reserve order starting from the offending instruction (lhu a1,0(a3) @ 0x182e0) for all the memory accesses referencing the impacted region, i.e., the range [0x000fb9c8, 0x000fb9dd0). See below, I/<seq> is the sequence number of an instruction in the dynamic instruction stream:

S/349808 C/349808 I/349768 PC/0x0000000000018290 (0x0087b603) ld      a2, 8(a5)
    RS0/a5  0x000fb9c0
    RD/a2   0x680a21303320656e
    ADDR    0x000fb9c8
S/349783 C/349783 I/349743 PC/0x000000000001b788 (0x00873683) ld      a3, 8(a4)
    RS0/a4  0x000fb9c0
    RD/a3   0x680a21303320656e
    ADDR    0x000fb9c8
S/300383 C/300383 I/300343 PC/0x000000000001d148 (0xfed70fa3) sb      a3, -1(a4)
    RS0/a4  0x000fb9d0
    RS1/a3  0x00000068
    ADDR    0x000fb9cf
S/300065 C/300065 I/300025 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9ce
    RS1/a2  0x0000000a
    ADDR    0x000fb9ce
S/300060 C/300060 I/300020 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9cd
    RS1/a2  0x00000021
    ADDR    0x000fb9cd
S/299946 C/299946 I/299906 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9cc
    RS1/a2  0x00000030
    ADDR    0x000fb9cc
S/299941 C/299941 I/299901 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9cb
    RS1/a2  0x00000033
    ADDR    0x000fb9cb
S/299730 C/299730 I/299690 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9ca
    RS1/a2  0x00000020
    ADDR    0x000fb9ca
S/299725 C/299725 I/299685 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9c9
    RS1/a2  0x00000065
    ADDR    0x000fb9c9
S/299720 C/299720 I/299680 PC/0x000000000001d1a4 (0x00c68023) sb      a2, 0(a3)
    RS0/a3  0x000fb9c8
    RS1/a2  0x0000006e
    ADDR    0x000fb9c8
S/273680 C/273680 I/273641 PC/0x0000000000019770 (0x0087b783) ld      a5, 8(a5)
    RS0/a5  0x000fb9c0
    RD/a5   0x000fb250
    ADDR    0x000fb9c8
S/273632 C/273632 I/273593 PC/0x000000000001b07c (0x00843783) ld      a5, 8(s0)
    RS0/s0  0x000fb9c0
    RD/a5   0x000fb250
    ADDR    0x000fb9c8
...

So before I/273641 PC/0x0000000000019770 the tcache pointer kept in the TLS block still looks valid. The culprit is a bunch of byte store instructions with PC 0x1d1a4 and 0x1d148. The reconstructed call graph shows those instructions belong to the memcpy function called by fprintf, which was called by main to print lines to the standard output:

The call graph and instruction trace reveals the IO output buffer (for stdout and used by the fprintf) is [0x000fb710, 0x000fc710), which overlaps the region [0x000fb9c8, 0x000fb9dd0) for keeping the tcache pointer in the TLS block (related files:glibc/stdio-common/fprintf.c:32, glibc/stdio-common/vfprintf-internal.c:1522, glibc/stdio-common/printf_buffer_to_file.c:30, glibc/stdio-common/vfprintf-internal.c:1523). And this is why the tcache pointer stored in the TLB block was replaced by the strings to be printed to the stdout.

To dig out why the IO output buffer overlaps the TLS region, we need to locate the codes allocating the IO buffer and TLS block. Searching the first occurrence of 0x000fb710 (IO buffer for stdout) and 0x000fb9a0 (the TLS block pointed by register %tp) reveals the origin of those addresses:

The output buffer of stdout and used by fprintf is allocated by a call to malloc:

The TLS block hold in %tp is allocated by a call to _dl_early_allocate:

Now, the real question is why _dl_early_allocate and malloc call will hand out memory regions that overlap with each other.

What happened here is _dl_early_allocate was called during the glibc start-up phase to allocate space for the TLS block. At this moment, __curbrk (a global variable used by glibc to track the current brk address of the main heap) has not been initialized yet. _dl_early_allocate in this case will not initialize __curbrk and will let the first call to brk in the future take care of the current brk variable initialization. It will just issue raw sys_brk(0) system call first to obtain the current brk address, then add it with the requested memory size, and finally make another sys_brk call to allocate the requested amount of space:

_dl_early_allocate (size_t size)
{
...
  if (__curbrk != NULL)
    ...
  else
    {
      /* If brk has not been invoked, there is no need to update
         __curbrk.  The first call to brk will take care of that.  */
      void *previous = __brk_call (0);
      result = __brk_call (previous + size);
      if (result == previous)
        result = NULL;
      else
        result = previous;
    }
...
  return result;
}

Later, when the glibc start-up routine begins to initialize the internal of malloc, __sbrk will be called the first time and initialize __curbrk.

This is done by by calling __brk function with argument 0. __brk is a glibc wrapper function for sys_brk system call. It updates __curbrk variable with the return value of sys_brk. Here comes the problem:

On Linux, this works fine because sys_brk(0) will do nothing and just return the address of the current brk because the new brk address, zero, is invalid.

On Proxy Kernel, this is why the overlapping happened: because PK's sys_brk(0) can change the brk address: it moves the brk address to the minimal possible address (e.g., right above the .bss section) and return the brk address whatever after the change. So, all the allocations made by _dl_early_allocate before initializing __curbrk have the same address (starting right after .bss), and the internal of malloc is also set to start allocating memory right after .bss section.

Machine code for _dl_early_allocate:

void *
_dl_early_allocate (size_t size)
{
 20e18:       ff010113                add     sp,sp,-16
 20e1c:       00813023                sd      s0,0(sp)
 20e20:       00113423                sd      ra,8(sp)
void *result;

if (__curbrk != NULL)
 20e24:       000da797                auipc   a5,0xda
 20e28:       c047b783                ld      a5,-1020(a5) # faa28 <___brk_addr>
{
 20e2c:       00050413                mv      s0,a0
if (__curbrk != NULL)
 20e30:       02078663                beqz    a5,20e5c <_dl_early_allocate+0x44>
  /* If the break has been initialized, brk must have run before,
     so just call it once more.  */
  {
    result = __sbrk (size);
 20e34:       a14fe0ef                jal     1f048 <__sbrk>
    if (result == (void *) -1)
 20e38:       fff00713                li      a4,-1
    result = __sbrk (size);
 20e3c:       00050793                mv      a5,a0
    if (result == (void *) -1)
 20e40:       02e50c63                beq     a0,a4,20e78 <_dl_early_allocate+0x60>
  }

/* If brk fails, fall back to mmap.  This can happen due to
   unfortunate ASLR layout decisions and kernel bugs, particularly
   for static PIE.  */
if (result == NULL)
 20e44:       02078a63                beqz    a5,20e78 <_dl_early_allocate+0x60>
    else
      result = (void *) ret;
  }

return result;
}
 20e48:       00813083                ld      ra,8(sp)
 20e4c:       00013403                ld      s0,0(sp)
 20e50:       00078513                mv      a0,a5
 20e54:       01010113                add     sp,sp,16
 20e58:       00008067                ret

Execution trace (the return value of _dl_early_allocate is 0x000fb238):

S/146786 C/146786 I/146779 PC/0x0000000000020e50 (0x00078513) mv      a0, a5
RS0/a5  0x000fb238
RD/a0   0x000fb238

S/146787 C/146787 I/146780 PC/0x0000000000020e54 (0x01010113) addi    sp, sp, 16
RS0/sp  0x0ff6dcc0
RD/sp   0x0ff6dcd0

S/146788 C/146788 I/146781 PC/0x0000000000020e58 (0x00008067) ret
RS0/ra  0x00010c1c
TAKEN_PC    0x00010c1c

The program header of helloworld.rv64 is:

$ riscv64-unknown-linux-gnu-readelf -l helloworld.rv64

Elf file type is EXEC (Executable file)
Entry point 0x10564
There are 7 program headers, starting at offset 64

Program Headers:
Type           Offset             VirtAddr           PhysAddr
               FileSiz            MemSiz              Flags  Align
RISCV_ATTRIBUT 0x00000000000e3d35 0x0000000000000000 0x0000000000000000
               0x0000000000000046 0x0000000000000000  R      0x1
LOAD           0x0000000000000000 0x0000000000010000 0x0000000000010000
               0x000000000007e086 0x000000000007e086  R E    0x2000
LOAD           0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
               0x0000000000065128 0x000000000006a658  RW     0x2000
NOTE           0x00000000000001c8 0x00000000000101c8 0x00000000000101c8
               0x0000000000000020 0x0000000000000020  R      0x4
TLS            0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
               0x0000000000000018 0x0000000000000058  R      0x8
GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
               0x0000000000000000 0x0000000000000000  RW     0x10
GNU_RELRO      0x000000000007ebe0 0x0000000000090be0 0x0000000000090be0
               0x0000000000063420 0x0000000000063420  R      0x1

Section to Segment mapping:
Segment Sections...
 00     .riscv.attributes 
 01     .note.ABI-tag .rela.dyn .text .rodata .eh_frame .gcc_except_table 
 02     .tdata .preinit_array .init_array .fini_array .data.rel.ro .data .got .sdata .bss 
 03     .note.ABI-tag 
 04     .tdata .tbss 
 05     
 06     .tdata .preinit_array .init_array .fini_array .data.rel.ro

That means the image layout is:

.text:  0x0000000000010000 - 0x000000000008e086
.data:  0x0000000000090be0 - 0x00000000000f5d08
.bss:   0x00000000000f5d08 - 0x00000000000fb238

The previous toolchain port (GCC 9.2.0/glibc 2.29) did not experience a similar crash because this version of _dl_early_allocate function was implemented in 2022, and glibc 2.29 in the previous port will just initialize the TLS block with a direct call to __sbrk (which initializes __curbrk immediately).

Fix

One possible fix is making the proxy kernel's sys_brk behave like the Linux version, at least when the new brk is 0x0 it should do nothing and only return the current brk. Patch 5a31996 implemented this behavior and has been tested effective.

s117 / anycore-riscv

RV64 program compiled by gcc-13.2.0 (with glibc2.38) crash if using standard output redirection #21

TL;DR.

Analysis

Fix