google / syzkaller

syzkaller is an unsupervised coverage-guided kernel fuzzer
Apache License 2.0
5.36k stars 1.23k forks source link

ipc/ipc.go: sometimes a werid dealock for every proc #4928

Closed xrivendell7 closed 4 months ago

xrivendell7 commented 4 months ago

Hi. Sorry for an old and weird problem I have. I meet a problem with an older syzkaller version, like https://github.com/google/syzkaller/commit/cb976f63e0177b96eb9ce1c631cc5e2c4b4b0759. I developed from this version and added some new functions for research. However, I meet a deadlock when fuzzing for a while, the vm/fuzzer seems deadlocked and does not execute progs anymore, like

2024/06/25 23:19:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:19:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:20:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:21:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:22:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:23:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:24:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:44 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:25:54 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:26:04 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:26:14 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:26:24 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716
2024/06/25 23:26:34 VMs 3, executed 28262, cover 22609, signal 23098/60097, crashes 3, repro 0, triageQLen 1716

I debugged for days but found I opened 4 procs for each vm, and debugged on one of the syz-fuzzer in the vm-0.

All the goroutines for syz-fuzzer:

(dlv) goroutines
  Goroutine 1 - User: /home/spark/reply/syzreplaying/syz-fuzzer/fuzzer.go:433 main.(*Fuzzer).pollLoop (0xd73433) [select]
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [force gc (idle) 190757904102]
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [GC sweep wait]
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [GC scavenge wait]
  Goroutine 5 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [finalizer wait]
  Goroutine 6 - User: /home/spark/reply/syzreplaying/vendor/go.opencensus.io/stats/view/worker.go:292 go.opencensus.io/stats/view.(*worker).start (0xd0ecc5) [select]
  Goroutine 7 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [debug call]
  Goroutine 8 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [debug call]
  Goroutine 9 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [debug call]
  Goroutine 14 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait 193820129774]
  Goroutine 15 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6522)
  Goroutine 17 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [debug call 414532511891]
  Goroutine 20 - User: /usr/local/go/src/runtime/proc.go:399 runtime.gopark (0x445a3c) [select 190811604376]
  Goroutine 21 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait 193820129774]
  Goroutine 22 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6530)
  Goroutine 43 - User: /usr/local/go/src/runtime/sigqueue.go:152 os/signal.signal_recv (0x4766c9) (thread 6521)
  Goroutine 44 - User: /usr/local/go/src/net/fd_posix.go:55 net.(*netFD).Read (0x6f9f13) [IO wait]
  Goroutine 45 - User: /usr/local/go/src/runtime/sema.go:527 sync.runtime_notifyListWait (0x4762b2) [sync.Cond.Wait 414532511891]
  Goroutine 46 - User: /usr/local/go/src/runtime/sema.go:527 sync.runtime_notifyListWait (0x4762b2) [sync.Cond.Wait]
  Goroutine 47 - User: /usr/local/go/src/runtime/sema.go:527 sync.runtime_notifyListWait (0x4762b2) [sync.Cond.Wait]
  Goroutine 48 - User: /usr/local/go/src/runtime/sema.go:527 sync.runtime_notifyListWait (0x4762b2) [sync.Cond.Wait]
  Goroutine 53 - User: /home/spark/reply/syzreplaying/pkg/osutil/osutil_unix.go:78 github.com/google/syzkaller/pkg/osutil.HandleInterrupts.func1 (0x97bfc6) [chan receive 190811604376]
  Goroutine 54 - User: /home/spark/reply/syzreplaying/syz-fuzzer/fuzzer.go:216 main.main.func1 (0xd722fe) [chan receive 190811604376]
  Goroutine 55 - User: /usr/local/go/src/net/fd_unix.go:172 net.(*netFD).accept (0x6fd106) [IO wait 190811604376]
  Goroutine 5203 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait]
  Goroutine 5204 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6535)
  Goroutine 5206 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait]
  Goroutine 5207 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6532)
[28 goroutines]

And I found all 4 procs are waiting for fuzzer gate tickets and causing the syz-fuzzer to freeze.

(dlv) p proc.fuzzer.gate
("*github.com/google/syzkaller/pkg/ipc.Gate")(0xc000b94180)
*github.com/google/syzkaller/pkg/ipc.Gate {
        cv: *sync.Cond {
                noCopy: sync.noCopy {},
                L: sync.Locker(*sync.Mutex) ...,
                notify: (*sync.notifyList)(0xc000b94090),
                checker: 824645861552,},
        busy: []bool len: 32, cap: 32, [false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,true,false,false,false,false,false,false,false],
        pos: 24,
        running: 2,
        stop: false,
        f: nil,}
(dlv) p proc.fuzzer.gate.cv
(*sync.Cond)(0xc000b94080)
*sync.Cond {
        noCopy: sync.noCopy {},
        L: sync.Locker(*sync.Mutex) *{state: 0, sema: 0},
        notify: sync.notifyList {wait: 473, notify: 469, lock: 0, head: unsafe.Pointer(0xc005aaa4e0), tail: unsafe.Pointer(0xc006fa7f20)},
        checker: 824645861552,}

However, I don't know why, and find that 4 four groups goroutines is stop syz-executor.0

Goroutine 14 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait 193820129774]
Goroutine 15 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6522)

syz-executor.1

Goroutine 21 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait 193820129774]
Goroutine 22 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6530)

syz-executor.2

Goroutine 5203 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait]
Goroutine 5204 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6535)

syz-executor.3

Goroutine 5206 - User: /usr/local/go/src/os/file_posix.go:29 os.(*File).read (0x545158) [IO wait]
Goroutine 5207 - User: /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 syscall.RawSyscall6 (0x4089ad) (thread 6532)

All the executing environment like

├─syz-fuzzer,6497 -executor=/syz-executor -name=vm-0 -arch=amd64 -manager=localhost:32893 -sandbox=none -procs=4 -cover=true -debug=false ...
  │   │       │   ├─syz-executor.0,6589 exec
  │   │       │   │   └─syz-executor.0,6593 exec
  │   │       │   ├─syz-executor.1,6592 exec
  │   │       │   │   └─syz-executor.1,6600 exec
  │   │       │   ├─syz-executor.2,24132 exec
  │   │       │   │   └─syz-executor.2,24159 exec
  │   │       │   ├─syz-executor.3,24131 exec
  │   │       │   │   └─syz-executor.3,24133 exec

For example syz-executor.1: pid: 0,6592

#0  0x000056488075e88c in read ()
#1  0x0000564880715497 in receive_execute () at executor/executor.cc:664
#2  0x0000564880717b74 in loop () at executor/common.h:662
#3  0x0000564880717d14 in do_sandbox_none () at executor/common_linux.h:4090
#4  0x0000564880718800 in main (argc=<optimized out>, argv=<optimized out>) at executor/executor.cc:531
#5  0x000056488071a154 in __libc_start_call_main ()
#6  0x000056488071b850 in __libc_start_main_impl ()
#7  0x0000564880705711 in _start ()

and 1,6600

#0  0x000056488075e88c in read ()
#1  0x0000564880715497 in receive_execute () at executor/executor.cc:664
#2  0x0000564880717b74 in loop () at executor/common.h:662
#3  0x0000564880717d14 in do_sandbox_none () at executor/common_linux.h:4090
#4  0x0000564880718800 in main (argc=<optimized out>, argv=<optimized out>) at executor/executor.cc:531
#5  0x000056488071a154 in __libc_start_call_main ()
#6  0x000056488071b850 in __libc_start_main_impl ()
#7  0x0000564880705711 in _start ()

That syz-executor.1 's parent 6592 is wait for child 6600 which is waiting for receive_execute However, the goroutine14/15 is wait at c.cmd.Wait() and rp.Read at makeCommand.

I'm curious, why the go can't execute at handshake to let all the proc runs correctly.

    wp.Close()
    // Note: we explicitly close inwp before calling handshake even though we defer it above.
    // If we don't do it and executor exits before writing handshake reply,
    // reading from inrp will hang since we hold another end of the pipe open.
    inwp.Close()

    if c.config.UseForkServer {
        if err := c.handshake(); err != nil {
            return nil, err
        }
    }
    tmp := c
    c = nil // disable defer above
    return tmp, nil
}

I just added another Queue to store interesting seeds(progs) and execute them for more power. I don't know why this would cause such a weird situation. What could be the cause of this problem and why stop like above? I also can't figure out the root cause; I'll provide more details for debugging if needed. Thanks!

dvyukov commented 4 months ago

Hi @xrivendell7,

This happened 700 commits ago. pkg/ipc and the gate logic does not exist whatsoever. I would suggest to update to HEAD.

xrivendell7 commented 4 months ago

Hi @dvyukov, I know this this happend long time ago and syz-manager/syz-fuzzer have refactored totally. However, It's hard to transplant my older code to HEAD now. So, I mean, have we encountered similar problems before, or is there a better way to troubleshoot this problem? Because this problem has really bothered me for a long time, and it happens so often that I am very frustrated to handle it. Thanks!

dvyukov commented 4 months ago

I don't remember any similar symptoms. Maybe something was fixed at some point, you may check commits to pkg/ipc after you base commit.