Closed datacompboy closed 1 month ago
Yes, golang part is this one: https://github.com/golang/go/blob/27093581b2828a2752a6d2711def09517eb2513b/src/runtime/signal_unix.go#L938
func dieFromSignal(sig uint32) {
unblocksig(sig)
raise(sig)
...
where unblocksig is sigprocmask(_SIG_UNBLOCK, &set, nil) and raise(sig) is tgkill(getpid(), gettid(), sig)
I think some of the confusion re the strace log may be because we recently switched from logging the syscalls before exection to logging after (since this lets us observe and log what was "returned" via output params). It looks like the logging happens after we do signal handling (including invoking handlers which themselves may make syscalls). We might be able to rearrange it so that the logging happens after syscall return but before signal handlers are invoked, but it might be tricky.
Here's where panic happens in the source (shim/src/signals.rs
)
if sigaltstack_orig_emu
.flags_retain()
.contains(SigAltStackFlags::SS_ONSTACK)
{
// The specified stack is already in use.
//
// Documentation is unclear what should happen, but switching to
// the already-in-use stack would almost certainly go badly.
panic!("Alternate stack already in use.")
}
Basically, if our bookkeeping is correct here, it looks like the managed code is in one signal handler configured to run with an alternate stack, and while that's still running gets and handles another signal configured to run on the same stack. The thinking for panicking here is that this would likely indicate a bug in shadow's bookkeeping (or maybe the managed program), but in this case it looks like the intent might be for the stack to never unwind, so it might actually be safe.
I would try changing the panic!
to a warn!
. i.e. give us something in the logs that things might be about to go badly, but let it keep running and see what happens.
Okay, I've constructed minimal repro without need for anything but golang example & restarter and clean latest origin/main (daa5b470969226322a50e110f3f02b6e9e7d1517).
To install "restarter":
$ git clone https://bitbucket.org/sivann/restarter.git
$ cd restarter
$ make
$ cp restarter ~/.local/bin/restarter
Golang I have right now:
$ go version
go version go1.19.6 linux/amd64
Shadow repro patch:
diff --git a/src/test/golang/CMakeLists.txt b/src/test/golang/CMakeLists.txt
index 8065f91a1..8b233bc6c 100644
--- a/src/test/golang/CMakeLists.txt
+++ b/src/test/golang/CMakeLists.txt
@@ -15,7 +15,7 @@ macro(add_golang_test_exe)
build
# Ensure LD_PRELOAD will work. Without this golang programs may or may
# not be linked dynamically, depending on their dependencies.
- -linkshared
+ #-linkshared
-o ${CMAKE_CURRENT_BINARY_DIR}/${AGTE_BASENAME}
${CMAKE_CURRENT_SOURCE_DIR}/${AGTE_BASENAME}.go
)
diff --git a/src/test/golang/simple_http.yaml b/src/test/golang/simple_http.yaml
index f989b5591..419ff6187 100644
--- a/src/test/golang/simple_http.yaml
+++ b/src/test/golang/simple_http.yaml
@@ -9,8 +9,13 @@ hosts:
server:
network_node_id: 0
processes:
- - path: ./test_simple_http
- start_time: 3s
+ - path: restarter
+ args:
+ - "-t"
+ - "3"
+ - "-c"
+ - "../../../test_simple_http"
+ start_time: 1s
expected_final_state: running
client1: &host
Now you can build & trigger panic:
$ ./setup build --test --extra && ./setup test --extra 'simple_http'; cat build/src/test/golang/simple_http-shadow.data/hosts/server/*.shimlog
...
The following tests FAILED:
70 - simple_http-shadow (Timeout)
...
2024-08-19 18:31:15,830 [INFO] ctest returned 8
thread '<unnamed>' panicked at lib/shim/src/signals.rs:195:17:
Alternate stack already in use.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
I think some of the confusion re the strace log may be because we recently switched from logging the syscalls before exection to logging after (since this lets us observe and log what was "returned" via output params).
Oh, thanks. That explains a lot! (I've tried to repro this with minimal raw C code, and ended up in various examples which doesn't produce panic -- but clearly miss a bunch of syscalls from the interrupt).
With panic! => warn! change:
00:00:00.034182638 [00:00:04.000000000] [shd-shim] [WARN] [lib/shim/src/signals.rs:195] [shadow_shim::signals] Alternate stack already in use.
and server correctly restarted and continue to work after the restart as expected.
Not sure is this universal, or just doesn't matter in this case (double stack during SIGTERM processing)
Great, thanks for the update! I'll plan to put together a PR with this change, and hopefully a regression test. It seems like we might be able to repro with something like a kill -s SIGTERM 1000
rather than bringing in restarter
.
You right! Just "kill -s SIGTERM 1000" indeed triggers the same behaviour:
$ cat build/src/test/golang/simple_http-shadow.data/hosts/server/*shimlog
00:00:00.038816200 [00:00:05.000000000] [shd-shim] [WARN] [lib/shim/src/signals.rs:195] [shadow_shim::signals] Alternate stack already in use.
Test:
hosts:
server:
network_node_id: 0
processes:
- path: ./test_simple_http
start_time: 1s
expected_final_state: running
- path: kill
args: "-s SIGTERM 1000"
start_time: 5s
Somehow I thought the problem is not just on golang side.
Describe the issue I am trying to get large system under the shadow to shake the tree and see where fruits fall. I've wrapped the parts of system under https://bitbucket.org/sivann/restarter.git to get things auto-restarted and some also auto-killed. That working well, until recently I've got to the condition, when shadow stuck.
Digging into stuck state shown shadow waiting for zombie, and the zombie has the "Alternate stack already in use." panic in shimlog; the syscall: tgkill(self, self, 15).
To Reproduce I can't reliable reproduce it under reduced setup (yet). :(
Operating System (please complete the following information):
Additional context
The panic itself:
Last syscalls (1000 -- main restarter process, 1001 -- restarer's child spawner & monitor, 1002 -- target process):