gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
584 stars 193 forks source link

Gramine-Direct Fuzzing: assert failed ../pal/src/pal_object.c:50 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND #1525

Open jinengandhi-intel opened 1 year ago

jinengandhi-intel commented 1 year ago

Description of the problem

Syzkaller has been modified in order to run with Gramine (https://github.com/JaewonHur/syzkaller). We were able to run it with Gramine v1.5 for Gramine-Direct and below crash is reported by the tool.

(pal_object.c:50:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:50 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND

Steps to reproduce

This crash was reported 10 times by Syzkaller fuzz tool but we are not able to reproduce the crashes manually. Filing this bug to see if from the logs we can deduce if this is a real bug or not. Note: the first instance of the issue was found only after the fuzzer had already run for more than 30hours.

Attaching a zip of the crash logs from a few test programs along with their C source files, specifically look at crash-ccc71c0aa0201433.log and any other crash file. pal_handle_assert.zip

Expected results

No response

Actual results

... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from futex(...) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- clock_gettime(1, 0x7fe6bfcbe530) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x32) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x32 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x3) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x3 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x6) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x6 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0xa) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0xa (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x2) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- ioctl(6, OP 0x89f0, 0x20001840) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from ioctl(...) = -25 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- clock_gettime(1, 0x7fe6bfcbe530) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x44) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x44 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x9) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x9 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(2, 0x7fe6bfcbe3b0, 0x1) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x1 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- futex(0x7fe6bf14bf4c, FUTEX_PRIVATE|FUTEX_WAKE, 1000000, 0x1, 0, 0) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from futex(...) = 0x0 (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- write(248, 0x7fe6bfcbe550, 0x2c) ... (libos_parser.c:1628:buf_write_all) [P2:T2:syz-executor] trace: ---- return from write(...) = 0x2c (libos_exit.c:212:libos_syscall_exit_group) [P2:T2:syz-executor] debug: ---- exit_group (returning 1) (libos_ipc.c:222:ipc_send_message_to_conn) [P2:T2:syz-executor] debug: Sending ipc message to 1 (libos_ipc_worker.c:193:receive_ipc_messages) [P1:libos] debug: IPC worker: received IPC message from 2: code=17 size=21 seq=2 (libos_fs_lock.c:785:file_lock_clear_pid) [P1:libos] debug: clearing file (POSIX) locks for pid 2 (libos_ipc.c:222:ipc_send_message_to_conn) [P1:libos] debug: Sending ipc message to 2 (libos_ipc_worker.c:193:receive_ipc_messages) [P2:libos] debug: IPC worker: received IPC message from 1: code=0 size=21 seq=2 (libos_ipc.c:341:ipc_response_callback) [P2:libos] debug: Got an IPC response from 1, seq: 2 (libos_ipc.c:257:wait_for_response) [P2:T2:syz-executor] debug: Waiting for a response to 2 (libos_ipc.c:264:wait_for_response) [P2:T2:syz-executor] debug: Waiting finished: Success (PAL_ERROR_SUCCESS) (libos_async.c:316:libos_async_worker) [P2:libos] debug: Async IO event triggered at 1693021561458619 (pal_object.c:50:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:50 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND (libos_ipc_child.c:21:ipc_child_disconnect_callback) [P1:libos] error: Child process (vmid: 0x2) got disconnected

Gramine commit hash

0b552696d4f402c291c8c2cce525e7117d06d397

stefanberger commented 11 months ago

I have been able to reproduce this issue. The issue seems to be only related to ioctl(..., FIOASYNC, ...). This is the first ioctl on one of the file descriptors of the pipe. The code following the ioctl in the test code (crash program) can be deactivated.

My feeling is it has something to do with the shutdown-sequencing of processes versus ipc worker because the following delay resolve the issue:

static int wait_for_loop(int pid)
{
  if (pid < 0)
    exit(1);
  int status = 0;
  while (waitpid(-1, &status, __WALL) != pid) {
  }
  sleep(1);  // <---------------- delay added
  return WEXITSTATUS(status);
}

I instrumented the code a bit:

diff --git a/libos/src/sys/libos_ioctl.c b/libos/src/sys/libos_ioctl.c
index 89d5424da..ef9e260c0 100644
--- a/libos/src/sys/libos_ioctl.c
+++ b/libos/src/sys/libos_ioctl.c
@@ -20,6 +20,7 @@ static void signal_io(IDTYPE caller, void* arg) {
     __UNUSED(caller);
     __UNUSED(arg);
     /* TODO: fill these values e.g. by getting the handle in arg; this is completely unusable now */
+    log_debug(">>>>>>>>> SENDING SIGIO <<<<<<<<<<<<<<<");
     siginfo_t info = {
         .si_signo = SIGIO,
         .si_code = SI_SIGIO,
diff --git a/pal/src/pal_object.c b/pal/src/pal_object.c
index 4728c1762..c4c9dd7f8 100644
--- a/pal/src/pal_object.c
+++ b/pal/src/pal_object.c
@@ -29,6 +29,8 @@ void PalObjectDestroy(PAL_HANDLE handle) {
 int PalStreamsWaitEvents(size_t count, PAL_HANDLE* handle_array, pal_wait_flags_t* events,
                          pal_wait_flags_t* ret_events, uint64_t* timeout_us) {
     for (size_t i = 0; i < count; i++) {
+        log_debug(">>>>>>>>>> handle_array[%ld]: %p", i, handle_array[i]);
+        log_debug(">>>>>>>>>> type: %d [0x%x] (< %d)", handle_array[i]->hdr.type, handle_array[i]->hdr.type, PAL_HANDLE_TYPE_BOUND);
         assert(!handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND);
     }

This leads to the following failure output (the type is always -1557600 on my machine):

pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[1]: 0x7fffffe83630
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 1 [0x1] (< 12)
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 2 [0x2] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[2]: 0x7fffffe83a30
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 3 [0x3] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[0]: 0x7fffffe83530
(libos_ipc.c:222:ipc_send_message_to_conn) [P2:T2:crash] debug: Sending ipc message to 1
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: 3 [0x3] (< 12)
(pal_object.c:32:PalStreamsWaitEvents) debug: >>>>>>>>>> handle_array[1]: 0x7fffffe83d30
(pal_object.c:33:PalStreamsWaitEvents) debug: >>>>>>>>>> type: -1557600 [0xffe83ba0] (< 12)
(pal_object.c:34:PalStreamsWaitEvents) assert failed ../pal/src/pal_object.c:34 !handle_array[i] || handle_array[i]->hdr.type < PAL_HANDLE_TYPE_BOUND

My next guess would be that the array being walked in pal/src/pal_object.c by the libos_async_worker doesn't have as many elements anymore and/or is being modified concurrently by something else.

stefanberger commented 11 months ago

Further investigation seems to show that the pipe's PAL_HANDLE has been free()'d in pipe_destroy() and that's why the handle_array[1] above is corrupt. This seems to happen after the ipc handler has visited the async_list and called PalStreamsWaitEvents() with the free()'d PAL_HANDLE still in the list.

stefanberger commented 11 months ago

The callpath libos_exit::thread_exit -> detach_all_fds -> detach_fd -> put_handle -> PalObjectDestroy() is the one responsible for the deletion of the PAL_HANDLE.

The solution may be to implement maybe_uninstall_async_event() that 1) checks whether async_worker_state != WORKER_ALIVE and in this case just returns 2) otherwise consults libos_async::async_list for whether the PAL_HANDLE is on that list and if this is the case then adds a DELETE event for the async_worker so it runs PalObjectDestroy() (via put_handle()). 3) we (may) have to ensure that the async_worker processes this event even if the next thing is does is take the exit

dimakuv commented 8 months ago

This seems to happen after the ipc handler has visited the async_list and called PalStreamsWaitEvents() with the free()'d PAL_HANDLE still in the list.

@stefanberger I don't understand what you're saying here. The IPC helper thread and the Async helper thread are two different threads doing two different things. The IPC helper thread never touches async_list.

If you can explain with the links to specific code lines, this will be better. Currently I'm lost in your explanation.

The issue seems to be only related to ioctl(..., FIOASYNC, ...).

I looked at FIOASYNC implementation and the Async helper thread implementation:

So the object field -- which points to the PAL handle (of e.g. the pipe) -- is used here in the Async thread:

So IIUC, that's the situation with the failure:

  1. The app called ioctl(..., FIOASYNC, ...) on a pipe
  2. This adds the pipe's PAL handle to async_list and ultimately to the PalStreamsWaitEvents() list of wait-on-pal-handles.
  3. The Async helper thread is doing its own stuff with waiting for events
  4. The app is terminated (or the pipe is closed, doesn't matter)
  5. The app thread destroys the pipe object
  6. The Async helper thread now has a dangling pipe's PAL handle in its lists.
  7. The Async helper thread at some point continues execution and blows up on the assert.

TLDR: The problem is that at steps 1-2, we didn't increment the ref count of the pipe object. In particular, this line is bad: https://github.com/gramineproject/gramine/blob/1f72aaf8148a7712c5c8bad1da9839e2bcc5c633/libos/src/sys/libos_ioctl.c#L106-L108

But that's not all. Just adding proper refcounting to this case may also be wrong. It looks like our FIOASYNC is just totally bogus, we don't even set up the fd parameter of the SIGIO signal.

dimakuv commented 8 months ago

Alternative solution for now -- to prevent these "assert failed" cases -- is to remove this logic completely from Gramine. It's clearly unused, and it only introduces bugs into our code base.

I looked more into it, and it's used by Nginx. Nginx spawns child processes and uses FIOASYNC generated SIGIO signals as a means of communication between the master process and the worker child processes.

Without FIOASYNC implemented in Gramine, that's what I get in Nginx error log:

2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)
2024/01/09 08:37:33 [alert] 1#0: ioctl(FIOASYNC) failed while spawning "worker process" (22: Invalid argument)