Open jinengandhi-intel opened 1 year ago
I have been able to reproduce this issue. The issue seems to be only related to ioctl(..., FIOASYNC, ...). This is the first ioctl on one of the file descriptors of the pipe. The code following the ioctl in the test code (crash program) can be deactivated.
My feeling is it has something to do with the shutdown-sequencing of processes versus ipc worker because the following delay resolve the issue:
static int wait_for_loop(int pid)
{
if (pid < 0)
exit(1);
int status = 0;
while (waitpid(-1, &status, __WALL) != pid) {
}
sleep(1); // <---------------- delay added
return WEXITSTATUS(status);
}
I instrumented the code a bit:
diff --git a/libos/src/sys/libos_ioctl.c b/libos/src/sys/libos_ioctl.c
index 89d5424da..ef9e260c0 100644
--- a/libos/src/sys/libos_ioctl.c
+++ b/libos/src/sys/libos_ioctl.c
@@ -20,6 +20,7 @@ static void signal_io(IDTYPE caller, void* arg) {
__UNUSED(caller);
__UNUSED(arg);
/* TODO: fill these values e.g. by getting the handle in arg; this is completely unusable now */
+ log_debug(">>>>>>>>> SENDING SIGIO <<<<<<<<<<<<<<<");
siginfo_t info = {
.si_signo = SIGIO,
.si_code = SI_SIGIO,
diff --git a/pal/src/pal_object.c b/pal/src/pal_object.c
index 4728c1762..c4c9dd7f8 100644
--- a/pal/src/pal_object.c
+++ b/pal/src/pal_object.c
@@ -29,6 +29,8 @@ void PalObjectDestroy(PAL_HANDLE handle) {
int PalStreamsWaitEvents(size_t count, PAL_HANDLE* handle_array, pal_wait_flags_t* events,
pal_wait_flags_t* ret_events, uint64_t* timeout_us) {
for (size_t i = 0; i < count; i++) {
+ log_debug(">>>>>>>>>> handle_array[%ld]: %p", i, handle_array[i]);
+ log_debug(">>>>>>>>>> type: %d [0x%x] (< %d)", handle_array[i]->hdr.type, handle_array[i]->hdr.type, PAL_HANDLE_TYPE_BOUND);
assert(!handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND);
}
This leads to the following failure output (the type is always -1557600 on my machine):
pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[1]: 0x7fffffe83630
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 1 [0x1] (< 12)
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 2 [0x2] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[2]: 0x7fffffe83a30
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 3 [0x3] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[0]: 0x7fffffe83530
(libos_ipc.c:222:ipc_send_message_to_conn) [P2:T2:crash] debug: Sending ipc message to 1
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 3 [0x3] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[1]: 0x7fffffe83d30
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: -1557600 [0xffe83ba0] (< 12)
(pal_object.c:34:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:34 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND
My next guess would be that the array being walked in pal/src/pal_object.c by the libos_async_worker doesn't have as many elements anymore and/or is being modified concurrently by something else.
Further investigation seems to show that the pipe's PAL_HANDLE has been free()'d in pipe_destroy() and that's why the handle_array[1] above is corrupt. This seems to happen after the ipc handler has visited the async_list and called PalStreamsWaitEvents() with the free()'d PAL_HANDLE still in the list.
The callpath libos_exit::thread_exit -> detach_all_fds -> detach_fd -> put_handle -> PalObjectDestroy() is the one responsible for the deletion of the PAL_HANDLE.
The solution may be to implement maybe_uninstall_async_event() that 1) checks whether async_worker_state != WORKER_ALIVE and in this case just returns 2) otherwise consults libos_async::async_list for whether the PAL_HANDLE is on that list and if this is the case then adds a DELETE event for the async_worker so it runs PalObjectDestroy() (via put_handle()). 3) we (may) have to ensure that the async_worker processes this event even if the next thing is does is take the exit
This seems to happen after the ipc handler has visited the async_list and called PalStreamsWaitEvents() with the free()'d PAL_HANDLE still in the list.
@stefanberger I don't understand what you're saying here. The IPC helper thread and the Async helper thread are two different threads doing two different things. The IPC helper thread never touches async_list
.
If you can explain with the links to specific code lines, this will be better. Currently I'm lost in your explanation.
The issue seems to be only related to ioctl(..., FIOASYNC, ...).
I looked at FIOASYNC implementation and the Async helper thread implementation:
object
field in this file)So the object
field -- which points to the PAL handle (of e.g. the pipe) -- is used here in the Async thread:
So IIUC, that's the situation with the failure:
ioctl(..., FIOASYNC, ...)
on a pipeasync_list
and ultimately to the PalStreamsWaitEvents()
list of wait-on-pal-handles.TLDR: The problem is that at steps 1-2, we didn't increment the ref count of the pipe object. In particular, this line is bad: https://github.com/gramineproject/gramine/blob/1f72aaf8148a7712c5c8bad1da9839e2bcc5c633/libos/src/sys/libos_ioctl.c#L106-L108
But that's not all. Just adding proper refcounting to this case may also be wrong. It looks like our FIOASYNC
is just totally bogus, we don't even set up the fd
parameter of the SIGIO
signal.
FIOASYNC
logic.Alternative solution for now -- to prevent these "assert failed" cases -- is to remove this logic completely from Gramine. It's clearly unused, and it only introduces bugs into our code base.
I looked more into it, and it's used by Nginx. Nginx spawns child processes and uses FIOASYNC
generated SIGIO signals as a means of communication between the master process and the worker child processes.
Without FIOASYNC implemented in Gramine, that's what I get in Nginx error log:
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
Description of the problem
Syzkaller has been modified in order to run with Gramine (https://github.com/JaewonHur/syzkaller). We were able to run it with Gramine v1.5 for Gramine-Direct and below crash is reported by the tool.
(pal_object.c:50:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:50 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND
Steps to reproduce
This crash was reported 10 times by Syzkaller fuzz tool but we are not able to reproduce the crashes manually. Filing this bug to see if from the logs we can deduce if this is a real bug or not. Note: the first instance of the issue was found only after the fuzzer had already run for more than 30hours.
Attaching a zip of the crash logs from a few test programs along with their C source files, specifically look at
crash-ccc71c0aa0201433.log
and any other crash file. pal_handle_assert.zipExpected results
No response
Actual results
... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from futex(...) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- clock_gettime(1, 0x7fe6bfcbe530) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x32) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x32 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x3) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x3 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x6) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x6 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0xa) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0xa (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- ioctl(6, OP 0x89f0, 0x20001840) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from ioctl(...) = -25 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- clock_gettime(1, 0x7fe6bfcbe530) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x44) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x44 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x9) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x9 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x1) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x1 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- futex(0x7fe6bf14bf4c, FUTEX_PRIVATE|FUTEX_WAKE, 1000000, 0x1, 0, 0) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from futex(...) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(248, 0x7fe6bfcbe550, 0x2c) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2c (libos_exit.c:212:libos_syscall_exit_group) [P2:T2:syz-executor] debug: ---- exit_group (returning 1) (libos_ipc.c:222:ipc_send_message_to_conn) [P2:T2:syz-executor] debug: Sending ipc message to 1 (libos_ipc_worker.c:193:receive_ipc_messages) [P1:libos] debug: IPC worker: received IPC message from 2: code=17 size=21 seq=2 (libos_fs_lock.c:785:file_lock_clear_pid) [P1:libos] debug: clearing file (POSIX) locks for pid 2 (libos_ipc.c:222:ipc_send_message_to_conn) [P1:libos] debug: Sending ipc message to 2 (libos_ipc_worker.c:193:receive_ipc_messages) [P2:libos] debug: IPC worker: received IPC message from 1: code=0 size=21 seq=2 (libos_ipc.c:341:ipc_response_callback) [P2:libos] debug: Got an IPC response from 1, seq: 2 (libos_ipc.c:257:wait_for_response) [P2:T2:syz-executor] debug: Waiting for a response to 2 (libos_ipc.c:264:wait_for_response) [P2:T2:syz-executor] debug: Waiting finished: Success (PAL_ERROR_SUCCESS) (libos_async.c:316:libos_async_worker) [P2:libos] debug: Async IO event triggered at 1693021561458619 (pal_object.c:50:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:50 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND (libos_ipc_child.c:21:ipc_child_disconnect_callback) [P1:libos] error: Child process (vmid: 0x2) got disconnected
Gramine commit hash
0b552696d4f402c291c8c2cce525e7117d06d397