golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.37k stars 17.71k forks source link

runtime: build fails when run via QEMU for linux/amd64 running on linux/arm64 #69255

Open myitcv opened 2 months ago

myitcv commented 2 months ago

Go version

go version go1.23.0 linux/arm64

Output of go env in your module/workspace:

$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/home/myitcv/.cache/go-build'
GOENV='/home/myitcv/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/myitcv/gostuff/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/myitcv/gostuff'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/myitcv/gos'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/home/myitcv/gos/pkg/tool/linux_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='on'
GOTELEMETRYDIR='/home/myitcv/.config/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/myitcv/tmp/dockertests/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build810191502=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Given:

-- Dockerfile --
FROM golang:1.23.0

WORKDIR /app
COPY . ./

RUN go build -o asdf ./blah

-- blah/main.go --
package main

func main() {

}
-- go.mod --
module mod.example

go 1.23.0

Running:

docker buildx build --platform linux/amd64 .

What did you see happen?

[+] Building 0.8s (8/8) FINISHED                                                                                                                                 docker-container:container-builder
 => [internal] load build definition from Dockerfile                                                                                                                                           0.0s
 => => transferring dockerfile: 110B                                                                                                                                                           0.0s
 => [internal] load metadata for docker.io/library/golang:1.23.0                                                                                                                               0.4s
 => [internal] load .dockerignore                                                                                                                                                              0.0s
 => => transferring context: 2B                                                                                                                                                                0.0s
 => [internal] load build context                                                                                                                                                              0.0s
 => => transferring context: 271B                                                                                                                                                              0.0s
 => CACHED [1/4] FROM docker.io/library/golang:1.23.0@sha256:613a108a4a4b1dfb6923305db791a19d088f77632317cfc3446825c54fb862cd                                                                  0.0s
 => => resolve docker.io/library/golang:1.23.0@sha256:613a108a4a4b1dfb6923305db791a19d088f77632317cfc3446825c54fb862cd                                                                         0.0s
 => [2/4] WORKDIR /app                                                                                                                                                                         0.0s
 => [3/4] COPY . ./                                                                                                                                                                            0.0s
 => ERROR [4/4] RUN go build -o asdf ./blah                                                                                                                                                    0.3s
------
 > [4/4] RUN go build -o asdf ./blah:
0.268 runtime: lfstack.push invalid packing: node=0xffffa45142c0 cnt=0x1 packed=0xffffa45142c00001 -> node=0xffffffffa45142c0
0.268 fatal error: lfstack.push
0.270
0.270 runtime stack:
0.270 runtime.throw({0xaf644d?, 0x0?})
0.271   runtime/panic.go:1067 +0x48 fp=0xc000231f08 sp=0xc000231ed8 pc=0x471228
0.271 runtime.(*lfstack).push(0xffffa45040b8?, 0xc0005841c0?)
0.271   runtime/lfstack.go:29 +0x125 fp=0xc000231f48 sp=0xc000231f08 pc=0x40ef65
0.271 runtime.(*spanSetBlockAlloc).free(...)
0.271   runtime/mspanset.go:322
0.271 runtime.(*spanSet).reset(0xfe7680)
0.271   runtime/mspanset.go:264 +0x79 fp=0xc000231f78 sp=0xc000231f48 pc=0x433559
0.271 runtime.finishsweep_m()
0.272   runtime/mgcsweep.go:257 +0x8d fp=0xc000231fb8 sp=0xc000231f78 pc=0x4263ad
0.272 runtime.gcStart.func2()
0.272   runtime/mgc.go:702 +0xf fp=0xc000231fc8 sp=0xc000231fb8 pc=0x46996f
0.272 runtime.systemstack(0x0)
0.272   runtime/asm_amd64.s:514 +0x4a fp=0xc000231fd8 sp=0xc000231fc8 pc=0x4773ca
...

My setup here is my host machine is linux/arm64, Qemu installed, following the approach described at https://docs.docker.com/build/building/multi-platform/#qemu, to build for linux/amd64.

This has definitely worked in the past which leads me to suggest that something other than Go has changed/been broken here. However I note the virtually identical call stack reported in https://github.com/golang/go/issues/54104 hence raising here in the first instance.

What did you expect to see?

Successful run of docker build.

gabyhelp commented 2 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

dmitshur commented 2 months ago

Do you think this is this similar or related to issue #68976? (It wasn't listed in the comment above, but it feels similar from a quick initial look.)

CC @prattmic, @matloob.

myitcv commented 2 months ago

Do you think this is this similar or related to issue #68976?

I don't know I'm afraid. That said the stack trace and symptoms seem quite different. I will however defer to @prattmic

prattmic commented 2 months ago

I agree, it looks quite different. #68976 is very specific to pidfd use in os/syscall. This looks like some form of corruption.

Do you know if this build is running a full Linux kernel in a VM, or using QEMU user mode Linux emulation?

prattmic commented 2 months ago
0.268 runtime: lfstack.push invalid packing: node=0xffffa45142c0 cnt=0x1 packed=0xffffa45142c00001 -> node=0xffffffffa45142c0

Notice

node=0xffffa45142c0       # before
node=0xffffffffa45142c0   # after

This seems like a sign extension issue when right shifting the packed value (See https://cs.opensource.google/go/go/+/master:src/runtime/lfstack.go;l=26-30, specifically lfstackUnpack).

I could imagine this being a code generation issue, or an issue in QEMU instruction emulation.

cc @golang/compiler

prattmic commented 2 months ago

Does the same issue occur on Go 1.22?

myitcv commented 2 months ago

Does the same issue occur on Go 1.22?

Yes. Indeed similar looking stacks for 1.21.13, 1.22.6, 1.23.0. Confirmed via:

cat <<EOD > template.txtar
-- Dockerfile --
FROM golang:$GOVERSION

WORKDIR /app
COPY . ./

RUN go build -o asdf ./blah

-- blah/main.go --
package main

func main() {

}
-- go.mod --
module mod.example

go $GOVERSION
EOD
for i in 1.23.0 1.22.6 1.21.13
do
        mkdir $i
        pushd $i > /dev/null
cat ../template.txtar | GOVERSION=$i envsubst | txtar-x
docker buildx build --platform linux/amd64 . > output 2>&1
popd > /dev/null
done
cat */output
myitcv commented 2 months ago

I'm miles out of my depth here, but in case this is useful:

$ qemu-amd64-static --version
qemu-x86_64 version 9.0.2 (Debian 1:9.0.2+ds-2+b1)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
myitcv commented 2 months ago

... but just to be super clear, I'm doing this via Docker:

https://docs.docker.com/build/building/multi-platform/#qemu

(so I'm actually unsure whether the host system qemu is used or not)

prattmic commented 2 months ago

I will see if I can reproduce when I get a chance.

As a workaround, do you actually need to do linux-amd64 builds via QEMU emulation? Go can cross-compile on its own well, though perhaps you have cgo dependencies that make it difficult?

mvdan commented 2 months ago

We did end up with a two-stage Dockerfile where the builder is on the host platform, cross-compiles to the target platform without cgo, and then the second stage builds an image for the target platform. So while we are not blocked by this bug as there's a workaround, it's probably worth keeping it open for a fix.

stsquad commented 2 months ago

We did some investigation for: https://gitlab.com/qemu-project/qemu/-/issues/2560 and we suspect the fault comes down to aarch64 only having 47 or 39 bits of address space while the x86_64 GC assume 48 bits. Under linux-user emulation we are limited by the host address space. However I do note 48 was chosen for all arches so I wonder how this works on native aarch64 builds of go?

prattmic commented 2 months ago

Thanks for taking a look!

cc @mknyszek who can speak more definitively about the address space layout, but I don't a smaller address space should be a problem. Go is pretty lenient about what it gets from mmap. I don't think we ever demand to be able to get a mapping with the 47th bit set.

If you haven't already seen it, take a look at https://github.com/golang/go/issues/69255#issuecomment-2329736628. My suspicion is that this is some sort of sign-extension bug given the only difference between the expected and actual output is the value of the upper bits.

prattmic commented 2 months ago

That said, on further thought, the input address 0xffffa45142c0 does look pretty weird. That isn't a typical heap address (the other addresses in the stack trace, e.g., sp=0xc000231ed8 do look like typical Go heap addresses), so I wonder how we got this one?

cherrymui commented 2 months ago

https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=149-210 this comment is about the heap address layout. We do use smaller address spaces on a few platforms, e.g. ios/arm64 is 40-bit, but the bits are set as constants so it would probably equally apply to native build and QEMU. (We could consider a qemu build tag?)

prattmic commented 2 months ago

Yes, we configure a larger heap address layout, but will anything break if the OS simply never returns addresses in the upper range? There isn't a case I can think of, provided our biggest mappings fit in the restricted address space. (Notice that amd64 configures 48-bit address space, even though Linux will only return addresses in the lower 47 bits)

In gVisor, we would restrict the Go runtime to a 39-bit region of address space without problem or modification to the Go runtime.

cherrymui commented 2 months ago

I think nothing would break if the OS never returns high addresses. The heapAddrBits is an upper limit, I think.

stsquad commented 2 months ago

Are there any runes for running the Go test cases (nothing jumped out at me). If we can trigger the failure with a direct testcase rather than deep in a docker image we can take a look at verifying the instruction behaviour.

prattmic commented 2 months ago

I have not personally reproduced, but in https://github.com/golang/go/issues/69255#issuecomment-2329869813 it is the compiler itself crashing, so theoretically it should reproduce by:

  1. Download a copy of Go and extract somewhere (which I'll call $EXTRACT_DIR): https://go.dev/dl/
  2. Create folder containing go.mod and main.go:

go.mod:

module example.com/app

go 1.23.1

main.go:

package main
func main() {}
  1. In the directory with go.mod/main.go, run $EXTRACT_DIR/bin/go build.

This will hopefully crash somewhere in the toolchain/compiler.

That said, go build does invoke multiple subprocesses, which I imagine could make debugging annoying. If you want literally just a single binary, you could try building a single test binary:

From outside QEMU (on any type of host), run GOOS=linux GOARCH=amd64 go test -c sort. This will build a sort.test linux-amd64 binary that contains the unit tests for the sort standard library package. I selected that package mostly arbitrarily: it is fairly complex so I hope it will trigger the bug and it has no dependency on external testdata files.

sort.test is a standalone, statically-linked binary, so you can copy it wherever and just run it. I do recommend passing ./sort.test -test.count=10 just to make it run long enough to run the GC.