Open dunglas opened 1 year ago
CC @ianlancetaylor.
Currently the Go runtime assumes the C stack for each thread is contiguous and never change. It's stack bounds is recorded in the g0
of the thread. When the C (or PHP) code switches stacks, its SP will not be within the boundary of the original stack, therefore causing morestack on g0
error. This is not surprising.
I don't know how to make it work in the current scheme. Maybe we could have a function that the user code can call to tell the runtime that the C code is going to switch stacks therefore disable stack bounds checks on g0 (e.g. by setting its bounds to [0, ^uintptr(0)])? That would require a new API, though.
@cherrymui how does that explain it mostly crashing in virtual/namespaced contexts (vms and docker) but not natively? Does Go do something differently then?
I don't know why the difference. But whether it crashes is going to depend on exactly where memory is allocated for fiber stacks. Go doesn't do anything different in virtual contexts, but it is possible that in those contexts memory is allocated at different addresses.
AWS for Fluent Bit uses Cgo to compile go code as .so that can be loaded with dlopen by Fluent Bit which is in C: https://github.com/aws/aws-for-fluent-bit
Example go plugin: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit
Fluent Bit source: https://github.com/fluent/fluent-bit
fluent-bit_1 | fatal: morestack on g0
Previous Go stable version 1.20.7 did not have this issue.
@ me and I can provide anymore details if useful.
@PettitWesley that looks like a different issue. Does the C code do its own stack management? Does the C code calls into Go, then calls back to C? Thanks.
@cherrymui The C code does its own stack management and has a smaller than normal stack size IIRC, I can find the details if needed.
Does the C code calls into Go
Yea, the C code invokes a function exposed by the Go code. And that's it, there's no call from Go into C AFAIK.
In triage, we think that there isn't a reasonable way we're going to make this work. If C code (or non-Go) is switching the stack, it's going to be a lot of work to make that work with the Go runtime reliably, and we don't plan to work on that any time soon. Apologies.
When we receive a signal, we check for the current stack pointer, and adjust the gsignal stack bounds to match (adjustSignalStack in runtime/signal_unix.go). We could do the same for calls from C to Go: a quick check for whether the stack pointer is within expected bounds, and if not a slower process to fetch the current stack bounds and use them.
Is there a reason why for us the exact opposite happens? https://github.com/php/php-src/issues/12021 I cannot reproduce this on Linux with x86 only in Apple Sillicon...
I expect there is a reason but I doubt it's very interesting. This problem will happen in some cases if a C thread changes its stack location, but whether a problem occurs depends on precisely how the stack location is changed. How the stack location is changed may depend on how the program is built and how the kernel behaves.
Change https://go.dev/cl/525455 mentions this issue: runtime: allow update of system stack bounds
@dunglas could you verify that your application works properly with https://go.dev/cl/525455?
(If you aren't familiar with building the toolchain from a CL, you can use gotip download 525455
: https://pkg.go.dev/golang.org/dl/gotip)
@prattmic Unfortunately it still doesn't work, but the error message changed:
=== RUN TestFiberBasic_module
logger.go:130: 2023-09-08T20:00:51.252Z DEBUG FrankenPHP started
runtime: newstack at runtime.printlock+0x55 sp=0xc0002dbb08 stack=[0xc0002da000, 0xc0002dc000]
morebuf={pc:0x408f85 sp:0xc0002dbb10 lr:0x0}
sched={pc:0x43f175 sp:0xc0002dbb08 lr:0x0 ctxt:0x0}
runtime.callbackUpdateSystemStack(0x408e5b?, 0x7f7bc4dff998, 0x0)
/root/sdk/gotip/src/runtime/cgocall.go:237 +0x85 fp=0xc0002dbb70 sp=0xc0002dbb10 pc=0x408f85
runtime.cgocallbackg(0x447c1c?, 0xc0001a76c0?, 0x300000002?)
/root/sdk/gotip/src/runtime/cgocall.go:287 +0x7a fp=0xc0002dbc00 sp=0xc0002dbb70 pc=0x40919a
runtime.cgocallbackg(0x8a5020, 0x7f7bc4dffa10, 0x0)
:1 +0x29 fp=0xc0002dbc28 sp=0xc0002dbc00 pc=0x4772a9
runtime.cgocallback(0xc0002dbc88, 0x408e85, 0x92fb60)
/root/sdk/gotip/src/runtime/asm_amd64.s:1035 +0xcc fp=0xc0002dbc50 sp=0xc0002dbc28 pc=0x47446c
runtime.systemstack_switch()
/root/sdk/gotip/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0002dbc60 sp=0xc0002dbc50 pc=0x4726e8
runtime.cgocall(0x92fb60, 0xc0002dbcc0)
/root/sdk/gotip/src/runtime/cgocall.go:175 +0x85 fp=0xc0002dbc98 sp=0xc0002dbc60 pc=0x408e85
github.com/***/frankenphp._Cfunc_frankenphp_execute_script(0x7f7bc80256b0)
_cgo_gotypes.go:802 +0x7b fp=0xc0002dbcc0 sp=0xc0002dbc98 pc=0x89c51b
github.com/***/frankenphp.go_execute_script(0x1ca9)
/go/src/app/frankenphp.go:471 +0x2ce fp=0xc0002dbe18 sp=0xc0002dbcc0 pc=0x89f5ce
_cgoexp_1cabc619c1cb_go_execute_script(0x7f7bc7ffea40)
_cgo_gotypes.go:957 +0x36 fp=0xc0002dbe30 sp=0xc0002dbe18 pc=0x8a4e76
runtime.cgocallbackg1(0x8a4e40, 0xc0002dbfe0?, 0x0)
/root/sdk/gotip/src/runtime/cgocall.go:403 +0x2ce fp=0xc0002dbf00 sp=0xc0002dbe30 pc=0x4095ce
runtime.cgocallbackg(0x0?, 0x0?, 0x0?)
/root/sdk/gotip/src/runtime/cgocall.go:319 +0x12d fp=0xc0002dbf90 sp=0xc0002dbf00 pc=0x40924d
runtime.cgocallbackg(0x8a4e40, 0x7f7bc7ffea40, 0x0)
:1 +0x29 fp=0xc0002dbfb8 sp=0xc0002dbf90 pc=0x4772a9
runtime.cgocallback(0x0, 0x0, 0x0)
/root/sdk/gotip/src/runtime/asm_amd64.s:1035 +0xcc fp=0xc0002dbfe0 sp=0xc0002dbfb8 pc=0x47446c
runtime.goexit({})
/root/sdk/gotip/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0002dbfe8 sp=0xc0002dbfe0 pc=0x4746c1
fatal error: runtime: stack split at bad time
runtime stack:
runtime.throw({0xa21a9f?, 0x0?})
/root/sdk/gotip/src/runtime/panic.go:1018 +0x5c fp=0x7f7bc4dff7e8 sp=0x7f7bc4dff7b8 pc=0x43d45c
runtime.newstack()
/root/sdk/gotip/src/runtime/stack.go:995 +0xa36 fp=0x7f7bc4dff998 sp=0x7f7bc4dff7e8 pc=0x459576
traceback: unexpected SPWRITE function runtime.morestack
runtime.morestack()
/root/sdk/gotip/s
snip
https://github.com/dunglas/frankenphp/actions/runs/6125832955/job/16628619273?pr=171
Here is a reproducer allowing to trigger the error in Docker: https://github.com/dunglas/frankenphp/pull/171
(Just take a look at the php-8-2-bookworm, linux/amd64
line of the matrix, other failures are due to the inability to use gotip
when cross-compiling).
Ah, I missed that the initial comment was reporting an issue in 1.20, not 1.21. My CL only fixes the regression in 1.21 (#62440).
To be clear there are two classes here:
This worked in 1.20, though we never really intended to support it. It broke in 1.21, and https://go.dev/cl/525455 fixes it.
I believe this is the case that PHP Fibers are doing. This was not supported on 1.20 and continues to be unsupported even after my CL.
The error you hit is my CL explicitly checking for this case and bailing out.
A follow-up may be able to support this case by allowing stack bounds to change on nested call as long as it is careful to restore the previous stack bounds on return (either before cgocallback returns to C, or after cgocall gets back into Go).
Off the top of my head I can't think of other things that would break from using different system stacks on different Go frames on the same thread, but it is a rather precarious situation, and one I'm still not convinced we should support.
I believe this is the case that PHP Fibers are doing.
This is indeed the case.
one I'm still not convinced we should support
This would help us a lot to have support for this: it's the last blocker to be able to embed PHP in Go web servers with support for 100% of the native features of the language. Maybe could we make this feature (temporarily) opt-in or mark it as unsafe in some way?
Reopening for now, since we have more information, but we still need to make a decision on this.
I don't see any major difficulty if Go calls C, C changes stack, C calls Go, C changes stack back, C returns to Go. In general we don't care where the C stack is as long as it's there. We have to get the bookeeping right but it seems doable.
That said I wouldn't be surprised if we will be expected to support Go calls C, C saves stack somewhere, a different thread picks up the saved C stack, that different thread returns to Go. I don't think that will work because we will have lost the g pointer. Is that a pattern that PHP Fibers expects to work?
As far as I understand, no: Fibers may call Go code (basically to send data to the client), but on the same C thread.
Thanks. That's not quite what I'm asking. What I'm asking is whether it is possible for Go to call into the fiber library on one thread, for that thread to save the current stack, and for another thread to pick up that stack and start running it. In other words, I'm asking whether Fibers are like goroutines, in that they can move between threads. Or do all fibers always run on the thread on which they started?
Fibers always run on the same thread. PHP (as well as libraries used for fibers) doesn't start nor manage threads itself, it lets the web server (or other kind of SAPIs) do it. Every PHP requests (including Fibers they start) are "locked" on the same thread.
OK, thanks, then at least for Fibers my concern is unfounded.
Change https://go.dev/cl/527715 mentions this issue: runtime: allow update of system stack bounds on callback from C thread
Change https://go.dev/cl/530480 mentions this issue: [release-branch.go1.21] runtime: allow update of system stack bounds on callback from C thread
Hi,
if we propose a patch to solve this problem (at least, in the case of PHP Fibers), would you accept it?
I thought this has been fixed by the CLs above? Could you try using Go 1.22rc1 or Go tip? Thanks.
cc @prattmic
@cherrymui the issue persists with gotip: https://github.com/dunglas/frankenphp/actions/runs/7519941503/job/20469065999?pr=171
Now that Go has been fixed, can the Dockerfile, etc.. update to a supported version of Go?
@max-allan-cgr AFAIK this issue hasn't been fixed yet.
Is this still not working? Does anybody understand what the current problem is?
Possibly a dupe of https://github.com/golang/go/issues/62440? Which was recently fixed in all the latest go releases.
This is still not working even with the unmerged CL that fixes https://github.com/golang/go/issues/68285.
Here is a reproducer: https://github.com/dunglas/frankenphp/pull/171
@cherrymui and @withinboredom have described the problem more in depth in https://github.com/golang/go/issues/68285#issuecomment-2229419157
For the record, we tested with the (somewhat) related patch recently merged, and this specific issue is still here.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
FrankenPHP is a Go module that embeds the PHP executor in Go programs such as the Caddy web server, and handles HTTP requests received by "net/http" with PHP scripts executed directly in the Go process, similarly to Apache mod_php.
To do so PHP is compiled as a C library and executed using cgo.
Under certain conditions, if the script uses PHP fibers, the program crashes with the message "fatal: morestack on g0" (see backtraces and reproducer below).
PHP Fibers do low-level stack manipulation and are likely badly interacting Go and cgo own stack management. Under the hood, PHP Fibers can use the boost.context library (by default) or
ucontext.h
(if PHP is compiled with--disable-fiber-asm
). The problem occurs with both implementations. Fibers code is in https://github.com/php/php-src/blob/master/Zend/zend_fibers.c.The problem seems to only randomly occur and only on Linux with x86 processors (we're not able to reproduce this on Mac or on ARM).
The problem seems to occur when the C code executed in a PHP Fiber (with its own stack) calls a Go function using cgo.
We currently have no idea if it's a problem in Go, in PHP, or "just" a bad interaction of their respective stack management systems.
What did you expect to see?
No error.
What did you see instead?
fatal: morestack on g0
Here are some stack traces collected by @slim with GDB:
The issue also always occurs when executing a Fiber in a Docker container running on GitHub Actions, (but oddly, not when running directly in a GHA runner).
A reproducer as well as some extra investigation are available in this PR: https://github.com/dunglas/frankenphp/pull/171
The problem can be reproduced locally by compiling PHP with the appropriate flags needed by FrankenPHP, checking out the
fibers
branch of FrankenPHP and runninggo test -run Fiber
.