Open myitcv opened 2 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Do you think this is this similar or related to issue #68976? (It wasn't listed in the comment above, but it feels similar from a quick initial look.)
CC @prattmic, @matloob.
Do you think this is this similar or related to issue #68976?
I don't know I'm afraid. That said the stack trace and symptoms seem quite different. I will however defer to @prattmic
I agree, it looks quite different. #68976 is very specific to pidfd use in os/syscall. This looks like some form of corruption.
Do you know if this build is running a full Linux kernel in a VM, or using QEMU user mode Linux emulation?
0.268 runtime: lfstack.push invalid packing: node=0xffffa45142c0 cnt=0x1 packed=0xffffa45142c00001 -> node=0xffffffffa45142c0
Notice
node=0xffffa45142c0 # before
node=0xffffffffa45142c0 # after
This seems like a sign extension issue when right shifting the packed value (See https://cs.opensource.google/go/go/+/master:src/runtime/lfstack.go;l=26-30, specifically lfstackUnpack
).
I could imagine this being a code generation issue, or an issue in QEMU instruction emulation.
cc @golang/compiler
Does the same issue occur on Go 1.22?
Does the same issue occur on Go 1.22?
Yes. Indeed similar looking stacks for 1.21.13, 1.22.6, 1.23.0. Confirmed via:
cat <<EOD > template.txtar
-- Dockerfile --
FROM golang:$GOVERSION
WORKDIR /app
COPY . ./
RUN go build -o asdf ./blah
-- blah/main.go --
package main
func main() {
}
-- go.mod --
module mod.example
go $GOVERSION
EOD
for i in 1.23.0 1.22.6 1.21.13
do
mkdir $i
pushd $i > /dev/null
cat ../template.txtar | GOVERSION=$i envsubst | txtar-x
docker buildx build --platform linux/amd64 . > output 2>&1
popd > /dev/null
done
cat */output
I'm miles out of my depth here, but in case this is useful:
$ qemu-amd64-static --version
qemu-x86_64 version 9.0.2 (Debian 1:9.0.2+ds-2+b1)
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
... but just to be super clear, I'm doing this via Docker:
https://docs.docker.com/build/building/multi-platform/#qemu
(so I'm actually unsure whether the host system qemu
is used or not)
I will see if I can reproduce when I get a chance.
As a workaround, do you actually need to do linux-amd64 builds via QEMU emulation? Go can cross-compile on its own well, though perhaps you have cgo dependencies that make it difficult?
We did end up with a two-stage Dockerfile where the builder is on the host platform, cross-compiles to the target platform without cgo, and then the second stage builds an image for the target platform. So while we are not blocked by this bug as there's a workaround, it's probably worth keeping it open for a fix.
We did some investigation for: https://gitlab.com/qemu-project/qemu/-/issues/2560 and we suspect the fault comes down to aarch64 only having 47 or 39 bits of address space while the x86_64 GC assume 48 bits. Under linux-user emulation we are limited by the host address space. However I do note 48 was chosen for all arches so I wonder how this works on native aarch64 builds of go?
Thanks for taking a look!
cc @mknyszek who can speak more definitively about the address space layout, but I don't a smaller address space should be a problem. Go is pretty lenient about what it gets from mmap. I don't think we ever demand to be able to get a mapping with the 47th bit set.
If you haven't already seen it, take a look at https://github.com/golang/go/issues/69255#issuecomment-2329736628. My suspicion is that this is some sort of sign-extension bug given the only difference between the expected and actual output is the value of the upper bits.
That said, on further thought, the input address 0xffffa45142c0
does look pretty weird. That isn't a typical heap address (the other addresses in the stack trace, e.g., sp=0xc000231ed8
do look like typical Go heap addresses), so I wonder how we got this one?
https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=149-210 this comment is about the heap address layout. We do use smaller address spaces on a few platforms, e.g. ios/arm64 is 40-bit, but the bits are set as constants so it would probably equally apply to native build and QEMU. (We could consider a qemu build tag?)
Yes, we configure a larger heap address layout, but will anything break if the OS simply never returns addresses in the upper range? There isn't a case I can think of, provided our biggest mappings fit in the restricted address space. (Notice that amd64 configures 48-bit address space, even though Linux will only return addresses in the lower 47 bits)
In gVisor, we would restrict the Go runtime to a 39-bit region of address space without problem or modification to the Go runtime.
I think nothing would break if the OS never returns high addresses. The heapAddrBits is an upper limit, I think.
Are there any runes for running the Go test cases (nothing jumped out at me). If we can trigger the failure with a direct testcase rather than deep in a docker image we can take a look at verifying the instruction behaviour.
I have not personally reproduced, but in https://github.com/golang/go/issues/69255#issuecomment-2329869813 it is the compiler itself crashing, so theoretically it should reproduce by:
$EXTRACT_DIR
): https://go.dev/dl/go.mod
and main.go
:go.mod
:
module example.com/app
go 1.23.1
main.go
:
package main
func main() {}
go.mod
/main.go
, run $EXTRACT_DIR/bin/go build
.This will hopefully crash somewhere in the toolchain/compiler.
That said, go build
does invoke multiple subprocesses, which I imagine could make debugging annoying. If you want literally just a single binary, you could try building a single test binary:
From outside QEMU (on any type of host), run GOOS=linux GOARCH=amd64 go test -c sort
. This will build a sort.test
linux-amd64 binary that contains the unit tests for the sort standard library package. I selected that package mostly arbitrarily: it is fairly complex so I hope it will trigger the bug and it has no dependency on external testdata files.
sort.test
is a standalone, statically-linked binary, so you can copy it wherever and just run it. I do recommend passing ./sort.test -test.count=10
just to make it run long enough to run the GC.
Go version
go version go1.23.0 linux/arm64
Output of
go env
in your module/workspace:What did you do?
Given:
Running:
What did you see happen?
My setup here is my host machine is
linux/arm64
, Qemu installed, following the approach described at https://docs.docker.com/build/building/multi-platform/#qemu, to build forlinux/amd64
.This has definitely worked in the past which leads me to suggest that something other than Go has changed/been broken here. However I note the virtually identical call stack reported in https://github.com/golang/go/issues/54104 hence raising here in the first instance.
What did you expect to see?
Successful run of
docker build
.