golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.09k stars 17.68k forks source link

cmd/cgo: "fatal: morestack on g0" when interacting with PHP Fibers #62130

Open dunglas opened 1 year ago

dunglas commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.20.5 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/slim/.cache/go-build"
GOENV="/home/slim/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/slim/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/slim/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org/,direct"
GOROOT="/usr/local/go"
GOSUMDB="[sum.golang.org](http://sum.golang.org/)"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.5"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/slim/pub/src/frankenphp/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3097512463=/tmp/go-build -gno-record-gcc-switches"

What did you do?

FrankenPHP is a Go module that embeds the PHP executor in Go programs such as the Caddy web server, and handles HTTP requests received by "net/http" with PHP scripts executed directly in the Go process, similarly to Apache mod_php.

To do so PHP is compiled as a C library and executed using cgo.

Under certain conditions, if the script uses PHP fibers, the program crashes with the message "fatal: morestack on g0" (see backtraces and reproducer below).

PHP Fibers do low-level stack manipulation and are likely badly interacting Go and cgo own stack management. Under the hood, PHP Fibers can use the boost.context library (by default) or ucontext.h (if PHP is compiled with --disable-fiber-asm). The problem occurs with both implementations. Fibers code is in https://github.com/php/php-src/blob/master/Zend/zend_fibers.c.

The problem seems to only randomly occur and only on Linux with x86 processors (we're not able to reproduce this on Mac or on ARM).

The problem seems to occur when the C code executed in a PHP Fiber (with its own stack) calls a Go function using cgo.

We currently have no idea if it's a problem in Go, in PHP, or "just" a bad interaction of their respective stack management systems.

What did you expect to see?

No error.

What did you see instead?

fatal: morestack on g0

Here are some stack traces collected by @slim with GDB:

(gdb) backtrace
#0  0x0000000000470106 in ?? ()
#1  0x000000000046e165 in runtime.morestack () at /usr/local/go/src/runtime/asm_amd64.s:542
#2  0x000000000042b8dc in runtime.(*mheap).alloc.func1 () at /usr/local/go/src/runtime/mheap.go:955
#3  0x000000000046e0e9 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:496
#4  0x00007fffa7ffec38 in ?? ()
#5  0x00007fffa4bff770 in ?? ()
#6  0x000000c000480000 in ?? ()
#7  0x00007fffa4bff770 in ?? ()
#8  0x00000000005c6bbd in crosscall2 () at /usr/local/go/src/runtime/cgo/asm_amd64.s:30
#9  0x0000000000758a0e in go_add_header (rh=23, cString=0x0, length=12175376) at _cgo_export.c:135
#10 0x00000000007596ff in frankenphp_send_headers (sapi_headers=0x7fff3801d800) at frankenphp.c:461
#11 frankenphp_send_headers (sapi_headers=0x7fff3801d800) at frankenphp.c:446
#12 0x00007ffff73fc665 in sapi_send_headers () at /home/slim/pub/src/php-franken/main/SAPI.c:883
#13 0x00007ffff734367b in php_header () at /home/slim/pub/src/php-franken/ext/standard/head.c:72
#14 0x00007ffff74086b2 in php_output_header () at /home/slim/pub/src/php-franken/main/output.c:122
#15 0x00007ffff740b208 in php_output_op (op=0, str=0x7fffc6202068 "Fiber 54", len=8) at /home/slim/pub/src/php-franken/main/output.c:1077
#16 0x00007ffff7408ee2 in php_output_write (str=0x7fffc6202068 "Fiber 54", len=8) at /home/slim/pub/src/php-franken/main/output.c:261
#17 0x00007ffff750a407 in ZEND_ECHO_SPEC_TMPVAR_HANDLER () at /home/slim/pub/src/php-franken/Zend/zend_vm_execute.h:14221
#18 0x00007ffff7566b69 in execute_ex (ex=0x7fffc6269070) at /home/slim/pub/src/php-franken/Zend/zend_vm_execute.h:57509
#19 0x00007ffff748c221 in zend_call_function (fci=0x7fffc625bcb8, fci_cache=0x7fffc625bcf8) at /home/slim/pub/src/php-franken/Zend/zend_execute_API.c:949
#20 0x00007ffff75b4fc3 in zend_fiber_execute (transfer=0x7fffa4bfffc0) at /home/slim/pub/src/php-franken/Zend/zend_fibers.c:504
#21 0x00007ffff75b46b2 in zend_fiber_trampoline (data=...) at /home/slim/pub/src/php-franken/Zend/zend_fibers.c:299
#22 0x00007ffff73e772f in make_fcontext () at /home/slim/pub/src/php-franken/Zend/asm/make_x86_64_sysv_elf_gas.S:135
#23 0x00007fffa4c00040 in ?? ()
#24 0x00007fffa4c00000 in ?? ()
#25 0x00007fffa4c00000 in ?? ()
#26 0x0000009200000172 in ?? ()
#27 0x0000000000000000 in ?? ()
[New Thread 0x7fffa07f8700 (LWP 28725)]
fatal: morestack on g0

Thread 11 "thpool-0" received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffa3fff700 (LWP 28718)]
0x0000000000471366 in ?? ()
(gdb) backtrace
#0  0x0000000000471366 in ?? ()
#1  0x000000000046f3c5 in runtime.morestack.abi0 ()
#2  0x000000000042c9fc in runtime.(*mheap).alloc.func1 ()
#3  0x000000000046f349 in runtime.systemstack.abi0 ()
#4  0x00007fffa3ffec38 in ?? ()
#5  0x00007fff9f9ff770 in ?? ()
#6  0x000000c000400400 in ?? ()
#7  0x00007fff9f9ff770 in ?? ()
#8  0x00000000005c7e1d in crosscall2 ()
#9  0x000000000070b780 in ?? ()
#10 0x00007fff9f9ff780 in ?? ()
#11 0x0000000000000000 in ?? ()

fatal: morestack on g0

Thread 10 "thpool-0" received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffc4918700 (LWP 28744)]
0x0000000000471366 in ?? ()
(gdb) backtrace
#0  0x0000000000471366 in ?? ()
#1  0x000000000046f3c5 in runtime.morestack.abi0 ()
#2  0x000000000042c9fc in runtime.(*mheap).alloc.func1 ()
#3  0x000000000046f349 in runtime.systemstack.abi0 ()
#4  0x00007fffc4917c38 in ?? ()
#5  0x00007fff83fff770 in ?? ()
#6  0x000000c000380400 in ?? ()
#7  0x00007fff83fff770 in ?? ()
#8  0x00000000005c7e1d in crosscall2 ()
#9  0x000000000070b780 in ?? ()
#10 0x00007fff83fff780 in ?? ()
#11 0x0000000000000000 in ?? ()

fatal: morestack on g0

Thread 11 "thpool-1" received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffabfff700 (LWP 28775)]
0x0000000000471366 in ?? ()
(gdb) backtrace
#0  0x0000000000471366 in ?? ()
#1  0x000000000046f3c5 in runtime.morestack.abi0 ()
#2  0x000000000042c9fc in runtime.(*mheap).alloc.func1 ()
#3  0x000000000046f349 in runtime.systemstack.abi0 ()
#4  0x00007fffabffec38 in ?? ()
#5  0x00007fff93fff770 in ?? ()
#6  0x000000c000300400 in ?? ()
#7  0x00007fff93fff770 in ?? ()
#8  0x00000000005c7e1d in crosscall2 ()
#9  0x000000000070b780 in ?? ()
#10 0x00007fff93fff780 in ?? ()
#11 0x0000000000000000 in ?? ()

The issue also always occurs when executing a Fiber in a Docker container running on GitHub Actions, (but oddly, not when running directly in a GHA runner).

A reproducer as well as some extra investigation are available in this PR: https://github.com/dunglas/frankenphp/pull/171

The problem can be reproduced locally by compiling PHP with the appropriate flags needed by FrankenPHP, checking out the fibers branch of FrankenPHP and running go test -run Fiber.

dmitshur commented 1 year ago

CC @ianlancetaylor.

cherrymui commented 1 year ago

Currently the Go runtime assumes the C stack for each thread is contiguous and never change. It's stack bounds is recorded in the g0 of the thread. When the C (or PHP) code switches stacks, its SP will not be within the boundary of the original stack, therefore causing morestack on g0 error. This is not surprising.

I don't know how to make it work in the current scheme. Maybe we could have a function that the user code can call to tell the runtime that the C code is going to switch stacks therefore disable stack bounds checks on g0 (e.g. by setting its bounds to [0, ^uintptr(0)])? That would require a new API, though.

withinboredom commented 1 year ago

@cherrymui how does that explain it mostly crashing in virtual/namespaced contexts (vms and docker) but not natively? Does Go do something differently then?

ianlancetaylor commented 1 year ago

I don't know why the difference. But whether it crashes is going to depend on exactly where memory is allocated for fiber stacks. Go doesn't do anything different in virtual contexts, but it is possible that in those contexts memory is allocated at different addresses.

PettitWesley commented 1 year ago

AWS for Fluent Bit uses Cgo to compile go code as .so that can be loaded with dlopen by Fluent Bit which is in C: https://github.com/aws/aws-for-fluent-bit

Example go plugin: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit

Fluent Bit source: https://github.com/fluent/fluent-bit

fluent-bit_1         | fatal: morestack on g0

Previous Go stable version 1.20.7 did not have this issue.

@ me and I can provide anymore details if useful.

cherrymui commented 1 year ago

@PettitWesley that looks like a different issue. Does the C code do its own stack management? Does the C code calls into Go, then calls back to C? Thanks.

PettitWesley commented 1 year ago

@cherrymui The C code does its own stack management and has a smaller than normal stack size IIRC, I can find the details if needed.

Does the C code calls into Go

Yea, the C code invokes a function exposed by the Go code. And that's it, there's no call from Go into C AFAIK.

mknyszek commented 1 year ago

In triage, we think that there isn't a reasonable way we're going to make this work. If C code (or non-Go) is switching the stack, it's going to be a lot of work to make that work with the Go runtime reliably, and we don't plan to work on that any time soon. Apologies.

ianlancetaylor commented 1 year ago

When we receive a signal, we check for the current stack pointer, and adjust the gsignal stack bounds to match (adjustSignalStack in runtime/signal_unix.go). We could do the same for calls from C to Go: a quick check for whether the stack pointer is within expected bounds, and if not a slower process to fetch the current stack bounds and use them.

leonardocustodio commented 1 year ago

Is there a reason why for us the exact opposite happens? https://github.com/php/php-src/issues/12021 I cannot reproduce this on Linux with x86 only in Apple Sillicon...

ianlancetaylor commented 1 year ago

I expect there is a reason but I doubt it's very interesting. This problem will happen in some cases if a C thread changes its stack location, but whether a problem occurs depends on precisely how the stack location is changed. How the stack location is changed may depend on how the program is built and how the kernel behaves.

gopherbot commented 1 year ago

Change https://go.dev/cl/525455 mentions this issue: runtime: allow update of system stack bounds

prattmic commented 1 year ago

@dunglas could you verify that your application works properly with https://go.dev/cl/525455?

(If you aren't familiar with building the toolchain from a CL, you can use gotip download 525455: https://pkg.go.dev/golang.org/dl/gotip)

dunglas commented 1 year ago

@prattmic Unfortunately it still doesn't work, but the error message changed:

Error logs === RUN TestFiberBasic_module logger.go:130: 2023-09-08T20:00:51.252Z DEBUG FrankenPHP started runtime: newstack at runtime.printlock+0x55 sp=0xc0002dbb08 stack=[0xc0002da000, 0xc0002dc000] morebuf={pc:0x408f85 sp:0xc0002dbb10 lr:0x0} sched={pc:0x43f175 sp:0xc0002dbb08 lr:0x0 ctxt:0x0} runtime.callbackUpdateSystemStack(0x408e5b?, 0x7f7bc4dff998, 0x0) /root/sdk/gotip/src/runtime/cgocall.go:237 +0x85 fp=0xc0002dbb70 sp=0xc0002dbb10 pc=0x408f85 runtime.cgocallbackg(0x447c1c?, 0xc0001a76c0?, 0x300000002?) /root/sdk/gotip/src/runtime/cgocall.go:287 +0x7a fp=0xc0002dbc00 sp=0xc0002dbb70 pc=0x40919a runtime.cgocallbackg(0x8a5020, 0x7f7bc4dffa10, 0x0) :1 +0x29 fp=0xc0002dbc28 sp=0xc0002dbc00 pc=0x4772a9 runtime.cgocallback(0xc0002dbc88, 0x408e85, 0x92fb60) /root/sdk/gotip/src/runtime/asm_amd64.s:1035 +0xcc fp=0xc0002dbc50 sp=0xc0002dbc28 pc=0x47446c runtime.systemstack_switch() /root/sdk/gotip/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0002dbc60 sp=0xc0002dbc50 pc=0x4726e8 runtime.cgocall(0x92fb60, 0xc0002dbcc0) /root/sdk/gotip/src/runtime/cgocall.go:175 +0x85 fp=0xc0002dbc98 sp=0xc0002dbc60 pc=0x408e85 github.com/***/frankenphp._Cfunc_frankenphp_execute_script(0x7f7bc80256b0) _cgo_gotypes.go:802 +0x7b fp=0xc0002dbcc0 sp=0xc0002dbc98 pc=0x89c51b github.com/***/frankenphp.go_execute_script(0x1ca9) /go/src/app/frankenphp.go:471 +0x2ce fp=0xc0002dbe18 sp=0xc0002dbcc0 pc=0x89f5ce _cgoexp_1cabc619c1cb_go_execute_script(0x7f7bc7ffea40) _cgo_gotypes.go:957 +0x36 fp=0xc0002dbe30 sp=0xc0002dbe18 pc=0x8a4e76 runtime.cgocallbackg1(0x8a4e40, 0xc0002dbfe0?, 0x0) /root/sdk/gotip/src/runtime/cgocall.go:403 +0x2ce fp=0xc0002dbf00 sp=0xc0002dbe30 pc=0x4095ce runtime.cgocallbackg(0x0?, 0x0?, 0x0?) /root/sdk/gotip/src/runtime/cgocall.go:319 +0x12d fp=0xc0002dbf90 sp=0xc0002dbf00 pc=0x40924d runtime.cgocallbackg(0x8a4e40, 0x7f7bc7ffea40, 0x0) :1 +0x29 fp=0xc0002dbfb8 sp=0xc0002dbf90 pc=0x4772a9 runtime.cgocallback(0x0, 0x0, 0x0) /root/sdk/gotip/src/runtime/asm_amd64.s:1035 +0xcc fp=0xc0002dbfe0 sp=0xc0002dbfb8 pc=0x47446c runtime.goexit({}) /root/sdk/gotip/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0002dbfe8 sp=0xc0002dbfe0 pc=0x4746c1 fatal error: runtime: stack split at bad time runtime stack: runtime.throw({0xa21a9f?, 0x0?}) /root/sdk/gotip/src/runtime/panic.go:1018 +0x5c fp=0x7f7bc4dff7e8 sp=0x7f7bc4dff7b8 pc=0x43d45c runtime.newstack() /root/sdk/gotip/src/runtime/stack.go:995 +0xa36 fp=0x7f7bc4dff998 sp=0x7f7bc4dff7e8 pc=0x459576 traceback: unexpected SPWRITE function runtime.morestack runtime.morestack() /root/sdk/gotip/s snip

https://github.com/dunglas/frankenphp/actions/runs/6125832955/job/16628619273?pr=171

Here is a reproducer allowing to trigger the error in Docker: https://github.com/dunglas/frankenphp/pull/171 (Just take a look at the php-8-2-bookworm, linux/amd64 line of the matrix, other failures are due to the inability to use gotip when cross-compiling).

prattmic commented 1 year ago

Ah, I missed that the initial comment was reporting an issue in 1.20, not 1.21. My CL only fixes the regression in 1.21 (#62440).

To be clear there are two classes here:

  1. Threads created in C that change stacks when there are no Go frames on the stack. e.g.,

This worked in 1.20, though we never really intended to support it. It broke in 1.21, and https://go.dev/cl/525455 fixes it.

  1. Threads changing stacks when there are Go frames on the stack. This could occur on a C or Go thread:

I believe this is the case that PHP Fibers are doing. This was not supported on 1.20 and continues to be unsupported even after my CL.

The error you hit is my CL explicitly checking for this case and bailing out.

A follow-up may be able to support this case by allowing stack bounds to change on nested call as long as it is careful to restore the previous stack bounds on return (either before cgocallback returns to C, or after cgocall gets back into Go).

Off the top of my head I can't think of other things that would break from using different system stacks on different Go frames on the same thread, but it is a rather precarious situation, and one I'm still not convinced we should support.

dunglas commented 1 year ago

I believe this is the case that PHP Fibers are doing.

This is indeed the case.

one I'm still not convinced we should support

This would help us a lot to have support for this: it's the last blocker to be able to embed PHP in Go web servers with support for 100% of the native features of the language. Maybe could we make this feature (temporarily) opt-in or mark it as unsafe in some way?

prattmic commented 1 year ago

Reopening for now, since we have more information, but we still need to make a decision on this.

ianlancetaylor commented 1 year ago

I don't see any major difficulty if Go calls C, C changes stack, C calls Go, C changes stack back, C returns to Go. In general we don't care where the C stack is as long as it's there. We have to get the bookeeping right but it seems doable.

That said I wouldn't be surprised if we will be expected to support Go calls C, C saves stack somewhere, a different thread picks up the saved C stack, that different thread returns to Go. I don't think that will work because we will have lost the g pointer. Is that a pattern that PHP Fibers expects to work?

dunglas commented 1 year ago

As far as I understand, no: Fibers may call Go code (basically to send data to the client), but on the same C thread.

ianlancetaylor commented 1 year ago

Thanks. That's not quite what I'm asking. What I'm asking is whether it is possible for Go to call into the fiber library on one thread, for that thread to save the current stack, and for another thread to pick up that stack and start running it. In other words, I'm asking whether Fibers are like goroutines, in that they can move between threads. Or do all fibers always run on the thread on which they started?

dunglas commented 1 year ago

Fibers always run on the same thread. PHP (as well as libraries used for fibers) doesn't start nor manage threads itself, it lets the web server (or other kind of SAPIs) do it. Every PHP requests (including Fibers they start) are "locked" on the same thread.

ianlancetaylor commented 1 year ago

OK, thanks, then at least for Fibers my concern is unfounded.

gopherbot commented 1 year ago

Change https://go.dev/cl/527715 mentions this issue: runtime: allow update of system stack bounds on callback from C thread

gopherbot commented 1 year ago

Change https://go.dev/cl/530480 mentions this issue: [release-branch.go1.21] runtime: allow update of system stack bounds on callback from C thread

dunglas commented 10 months ago

Hi,

if we propose a patch to solve this problem (at least, in the case of PHP Fibers), would you accept it?

cherrymui commented 10 months ago

I thought this has been fixed by the CLs above? Could you try using Go 1.22rc1 or Go tip? Thanks.

cc @prattmic

dunglas commented 10 months ago

@cherrymui the issue persists with gotip: https://github.com/dunglas/frankenphp/actions/runs/7519941503/job/20469065999?pr=171

max-allan-cgr commented 5 months ago

Now that Go has been fixed, can the Dockerfile, etc.. update to a supported version of Go?

https://github.com/golang/go/issues/62440

dunglas commented 5 months ago

@max-allan-cgr AFAIK this issue hasn't been fixed yet.

ianlancetaylor commented 3 months ago

Is this still not working? Does anybody understand what the current problem is?

aaronbee commented 3 months ago

Possibly a dupe of https://github.com/golang/go/issues/62440? Which was recently fixed in all the latest go releases.

dunglas commented 3 months ago

This is still not working even with the unmerged CL that fixes https://github.com/golang/go/issues/68285.

Here is a reproducer: https://github.com/dunglas/frankenphp/pull/171

@cherrymui and @withinboredom have described the problem more in depth in https://github.com/golang/go/issues/68285#issuecomment-2229419157

dunglas commented 2 days ago

For the record, we tested with the (somewhat) related patch recently merged, and this specific issue is still here.