Open a-milkyway opened 2 years ago
how do you know it was the panic that took 6 hours?
I did a pstack on the gobgpd from time to time during those six hours where no application processing was observed but go runtime was in garbage collection, here is what i saw. (i can probably copy other threads data as well, but for conciseness, providing this particular thread). This thread never made any progress before signal 11.
Thread 26 (Thread 0x7f3ea1ffb700 (LWP 57758)):
---Type
Eventually it crashed with a signal 11, but this above thread was still in the same state never acquiring debuglock.
@a-milkyway Thanks for the report, can you provide the complete stack information. Also, did you use cgo? Github can fold content, don't worry about too much content. Thanks again.
Change https://go.dev/cl/427414 mentions this issue: runtime: fix printlock/printunlock deadlock
@hopehook apologize for the delayed response. we use cgo. i lost the complete stack trace from my local machine, will try to see if it is stored somewhere and will provide it as soon as i get it.
@a-milkyway Welcome back and your local complete stack trace can be very important. As you said the program has been suspended for so long, the situation should be more complicated.
I suspect cgo is the most insecure factor, it may not be handled well.
Hi @hopehook I have a complete crashdump from a recent incident. I am attaching it here. Thank you so much! gobgpd.gz coredump.gz core_backtrace.gz
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
An application (gobgpd) encountered a panic and in the process of dumping stack traces go runtime tried to acquire a printLock which never became available preventing the application from a restart and causing application redundancy to fail.
What did you expect to see?
A panic should've completed in about a minute or so and application should've restarted.
What did you see instead?
The panic took six hours.
The runtime printlock function has an additional mp.locks increment/decrement operation to make the operation atomic but no such provision exists during printunlock. If the scheduler pre-empts the printunlock after decrementing printlock but before actually unlocking, this could cause this issue.
Here is the snippet: