golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.94k stars 17.66k forks source link

runtime: go runs into a deadlock where a panicing application never acquires the debuglock due to non-atomic unlock #54786

Open a-milkyway opened 2 years ago

a-milkyway commented 2 years ago

What version of Go are you using (go version)?

$ go version
1.17.6

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
Linux 4.14.35-2047.514.5.el7uek.x86_64 #2 SMP Mon Jun 13 21:57:28 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

An application (gobgpd) encountered a panic and in the process of dumping stack traces go runtime tried to acquire a printLock which never became available preventing the application from a restart and causing application redundancy to fail.

What did you expect to see?

A panic should've completed in about a minute or so and application should've restarted.

What did you see instead?

The panic took six hours.

The runtime printlock function has an additional mp.locks increment/decrement operation to make the operation atomic but no such provision exists during printunlock. If the scheduler pre-empts the printunlock after decrementing printlock but before actually unlocking, this could cause this issue.

Here is the snippet:

func printlock() {
    mp := getg().m
    mp.locks++ // do not reschedule between printlock++ and lock(&debuglock).
    mp.printlock++
    if mp.printlock == 1 {
        lock(&debuglock)
    }
    mp.locks-- // now we know debuglock is held and holding up mp.locks for us.
}

func printunlock() {
    mp := getg().m
    mp.printlock--
    if mp.printlock == 0 {
        unlock(&debuglock)
    }
}
seankhliao commented 2 years ago

how do you know it was the panic that took 6 hours?

a-milkyway commented 2 years ago

I did a pstack on the gobgpd from time to time during those six hours where no application processing was observed but go runtime was in garbage collection, here is what i saw. (i can probably copy other threads data as well, but for conciseness, providing this particular thread). This thread never made any progress before signal 11.

Thread 26 (Thread 0x7f3ea1ffb700 (LWP 57758)):

0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520

1 0x0000000000432f16 in runtime.futexsleep (addr=0xfffffffffffffe00, val=2, ns=4636867) at /usr/local/go/src/runtime/os_linux.go:44

2 0x000000000040cf99 in runtime.lock2 (l=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:107

---Type to continue, or q to quit---

3 0x0000000000438805 in runtime.lockWithRank (l=, rank=) at /usr/local/go/src/runtime/lockrank_off.go:23

4 runtime.lock (l=) at /usr/local/go/src/runtime/lock_futex.go:48

5 runtime.printlock () at /usr/local/go/src/runtime/print.go:71

6 0x0000000000437808 in runtime.dopanic_m (gp=0xc0011e71e0, pc=4419537, sp=139906982587536) at /usr/local/go/src/runtime/panic.go:1389

7 0x0000000000437288 in runtime.fatalthrow.func1 () at /usr/local/go/src/runtime/panic.go:1253

8 0x0000000000437210 in runtime.fatalthrow () at /usr/local/go/src/runtime/panic.go:1250

9 0x0000000000436fd1 in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:1198

10 0x00000000004157f0 in runtime.badPointer (s=0x7f3ee83fb978, p=824701829088, refBase=824640469408, refOff=40) at /usr/local/go/src/runtime/mbitmap.go:367

11 0x0000000000415986 in runtime.findObject (p=, refBase=, refOff=) at /usr/local/go/src/runtime/mbitmap.go:409

12 0x0000000000420310 in runtime.scanobject (b=, gcw=0xc000076e98) at /usr/local/go/src/runtime/mgcmark.go:1280

13 0x000000000041fb9a in runtime.gcDrain (gcw=0xc000076e98, flags=7) at /usr/local/go/src/runtime/mgcmark.go:1047

14 0x000000000041cd6e in runtime.gcBgMarkWorker.func2 () at /usr/local/go/src/runtime/mgc.go:1288

15 0x00000000004680e9 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:383

16 0x0000000000000000 in ?? ()

Eventually it crashed with a signal 11, but this above thread was still in the same state never acquiring debuglock.

hopehook commented 2 years ago

@a-milkyway Thanks for the report, can you provide the complete stack information. Also, did you use cgo? Github can fold content, don't worry about too much content. Thanks again.

gopherbot commented 2 years ago

Change https://go.dev/cl/427414 mentions this issue: runtime: fix printlock/printunlock deadlock

a-milkyway commented 2 years ago

@hopehook apologize for the delayed response. we use cgo. i lost the complete stack trace from my local machine, will try to see if it is stored somewhere and will provide it as soon as i get it.

hopehook commented 2 years ago

@a-milkyway Welcome back and your local complete stack trace can be very important. As you said the program has been suspended for so long, the situation should be more complicated.

I suspect cgo is the most insecure factor, it may not be handled well.

a-milkyway commented 1 year ago

Hi @hopehook I have a complete crashdump from a recent incident. I am attaching it here. Thank you so much! gobgpd.gz coredump.gz core_backtrace.gz