Closed ldalessa closed 3 months ago
I have reduced the test case by making data
as small as necessary, and eliminated the loop:
[[gnu::noipa]] void foo() {
volatile char data[81];
data[2] = '\0';
}
int main( int argc, char** argv ) {
foo();
return argv[0][0];
}
The assignment to data[2]
apparently corrupts argv[0]
, making the return argv[0][0]
dereference a nullptr
.
I looked at the assembly code, and it looks mostly correct, although it only allocates 80
bytes for data
instead of 81
, although that shouldn't affect the data[2]
assignment, and might be making sure sp
is aligned to 16
bytes:
00000000000101b6 <foo()>:
101b6: 715d add sp,sp,-80
101b8: 00010123 sb zero,2(sp)
101bc: 6161 add sp,sp,80
101be: 8082 ret
0000000000010106 <main>:
10106: 1141 add sp,sp,-16
10108: e022 sd s0,0(sp)
1010a: e406 sd ra,8(sp)
1010c: 842e mv s0,a1
1010e: 0a8000ef jal 101b6 <foo()>
10112: 601c ld a5,0(s0)
10114: 60a2 ld ra,8(sp)
10116: 6402 ld s0,0(sp)
10118: 0007c503 lbu a0,0(a5)
1011c: 0141 add sp,sp,16
1011e: 8082 ret
When I change the declaration to data[82]
, the compiler completely changes the stack allocation of data
to the next 16
byte stack allocation, and makes the assignment of data[2]
look strange:
[[gnu::noipa]] void foo() {
volatile char data[82];
data[2] = '\0';
}
int main( int argc, char** argv ) {
foo();
return argv[0][0];
}
00000000000101b6 <foo()>:
101b6: 711d add sp,sp,-96
101b8: 00010523 sb zero,10(sp)
101bc: 6125 add sp,sp,96
101be: 8082 ret
0000000000010106 <main>:
10106: 1141 add sp,sp,-16
10108: e022 sd s0,0(sp)
1010a: e406 sd ra,8(sp)
1010c: 842e mv s0,a1
1010e: 0a8000ef jal 101b6 <foo()>
10112: 601c ld a5,0(s0)
10114: 60a2 ld ra,8(sp)
10116: 6402 ld s0,0(sp)
10118: 0007c503 lbu a0,0(a5)
1011c: 0141 add sp,sp,16
1011e: 8082 ret
I will try to find the physical address of argv[0]
in the simulator, and set up watchpoints to catch its corruption.
It looks like an off-by-1 allocation bug, since char data[81]
is allocated only 80
bytes, while char data[82]
jumps to 96
bytes. But in our test case, we don't use data[80]
or data[81]
, so the compiler is free to generate code which produces the same results (the "as if" rule).
It looks like an off-by-1 allocation bug, since
char data[81]
is allocated only80
bytes, whilechar data[82]
jumps to96
bytes. But in our test case, we don't usedata[80]
ordata[81]
, so the compiler is free to generate code which produces the same results (the "as if" rule).
This does seem weird. I don't know why gcc rounds to 80 instead of 16, like clang does. My original code seems to allocate the 256 fine, and the crosstool-ng
musl
toolchain I'm using doesn't produce the off-by-one for 81
. I can't reproduce that at godbolt either (clang is smart enough to know it doesn't need the full 96 bytes).
Either way, that sb
shouldn't corrupt anything right, as it's well within the stack allocation (even if the stack allocation has been compacted because we're not accessing parts of it)?
Either way, that
sb
shouldn't corrupt anything right, as it's well within the stack allocation (even if the stack allocation has been compacted because we're not accessing parts of it)?
Yes. It is corrupting argv[0]
in main()
. I'm going through Rev trying to trace how and why it occurs.
Please verify this has been fixed (see #288 ) and close it.
Looks like it's fixed my issues. Thank you.
I'm still having some problems with stack corruption. I've tried to reduce the test case and come up with the following.
#include <cstdio>
int main(int, char** argv)
{
char a[512]{};
int n = snprintf(a, sizeof(a), "Processing records from %s\n", argv[1]);
if (argv[1] == nullptr) {
asm volatile( ".dword 0x00000000" ::: "memory" );
}
return 0;
}
Compiled this time as:
$ riscv64-unknown-linux-musl-gcc -march=rv64imafdc -o test test.cpp
Oddly, if I do not store the unused n
on the stack, there is no issue, if I leave the n
there I fail the assert.
This one uses snprintf
from musl
, but I do not believe that this is causing the issue directly. If you can't reproduce this let me know and I can try and install a different toolchain.
$ riscv64-unknown-linux-musl-gcc --version
riscv64-unknown-linux-musl-gcc (crosstool-NG 1.26.0.85_06fad54) 14.1.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Wait wait. I wasn't using -static
. I still am having problems but this test case doesn't exhibit it with -static
. I'll close for now and reopen once I reduce something.
Ugh, no... sorry for the churn. -static
does not resolve it, the test is still failing.
There might be two distinct issues at work. My reduced case and another unrelated argv
corruption bug were fixed by correcting the allocation of argv
. That wasn't caused by corruption, but by the misallocation of argv
to use the same space as other variables.
This may be running into an unrelated bug which might be corruption.
Okay. It wasn’t manifesting when I was using a static char buffer and it was manifesting with the nulled argv[] pointer so I assumed it was related. I guess if those are allocated near the stack then any memory corruption could look like this.
Does it still fail after https://github.com/tactcomplabs/rev/pull/290 has been merged into devel
?
Does it still fail after #290 has been merged into
devel
?
Hi Lee. This failure is still present. That #290 is referencing #289 and seems to be targeting it.
Does it still fail after #290 has been merged into
devel
?Hi Lee. This failure is still present. That #290 is referencing #289 and seems to be targeting it.
Yes, but it fixed a lot of corruption bugs.
I have reduced it some more, switching to C (smaller executable):
#include <stdio.h>
int main(int argc, char** argv)
{
char a[365];
int n = snprintf(a, sizeof(a), "");
if (!argv[1])
asm(".word 0");
}
The 365
size of a
is the size which triggers it. 364
does not. a
does not need to be initialized. The snprintf
is apparently necessary, but it can be an empty print. I tried other functions like puts
, atoi
and they did not trigger it. The variable n
is necessary.
(I am providing a single argument with --args="[1]"
.)
The stack pointer was not always aligned on 16 bytes. Try it now.
The stack pointer was not always aligned on 16 bytes. Try it now.
This seems to have resolved the test you posted, but if I bump that buffer size back up to 512
(using your C code example) it's still corrupting argv
and failing the assert.
I'm using musl
, I'm not sure if that's making any difference.
Increasing it 16
bytes to 381
makes it fail. I am stepping through the simulator, trying to figure out where corruption is occurring.
The corruption occurs:
104ac: f482 sd zero,104(sp)
when sp + 104 == 0x3ffffb33
, and &argv[1] == 0x3ffffb38
.
It is in _svfprintf_r
, which snprintf
calls.
sp
has the address 0x3ffffacb
, which is not aligned. When main()
is entered, sp == 0x3ffffffb
, which is not aligned.
It should be fixed now.
Works both for my reduced test case and RL code now. Thanks Lee.
Describe the bug
When I allocate and initialize a small
char
buffer on the stack it seems to be corrupting memory, in particular theargv
state. My understanding is that_STACK_SIZE_
is1MB
so I don't think that's the problem.To Reproduce
SST stuff
REV stuff (configured with
-DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_BUILD_TYPE=Release
).Test program.
volatile
to prevent the loop from being removed,noipa
to prevent the call tofoo
from being removed, I split it into a functionfoo
to make it easier for me to look at the asm, but that doesn't impact the bug as long as the loop isn't removed by the dead-code passes.C++ compiler stuff, aggressive flags to minimize produced asm (flags don't matter).
Output asm
I'm using the
rev-model-options-config.py
provided inrev/test
.Additional context
I have tried rev's Debug, RelWithDebInfo, Release, and default builds and this doesn't matter. I have tried building Rev with just
g++
instead of through thempic++
wrapper and that doesn't matter. The size of the buffer does matter, if I drop down into the 190s it won't fail, depending on the build that I'm using, but 256 bytes consistently failed.