Open mattrobenolt opened 2 years ago
cc @ianlancetaylor @neild
We reproduced this with Go 1.18.4 and Go 1.16.5 on M1 and Intel machines running OS X 12.3-12.5 with two different Go applications. All affected users had the fe80::*
IPv6 DNS address in their nameservers. The behaviour was always what the OP described as the extreme circumstances - hung process that ignores signals, eventual network disconnection and then a frozen machine needing forced reboots.
Changing DNS servers on the network connection to drop the fe80::*
address resolved it. As a mitigation we also started to build OS X binaries on OS X machines with Cgo enabled to force the use of the Cgo resolver. We lost several days debugging this ourselves and when I came to report this I found this issue already. Is there any update or ETA on when investigation might take place? This seems fairly critical.
One possible way to confound a reproduction on darwin_amd64
is if a program is being built for both darwin_amd64
and darwin_arm64
on a build worker whose native toolchain is the darwin_arm64
one. In that situation the default for CGO_ENABLED
is different between the two targets: darwin_arm64
is considered to be a cross-compile and therefore CGO_ENABLED
defaults to off.
I wrote some more about this in a comment on the general issue about the pure Go resolver on macOS. The short version is that if you want to reproduce it reliably it's important to explicitly set CGO_ENABLED=0
or CGO_ENABLED=1
consistently across builds for both architectures, and not rely on the default. That will then ensure that both of your builds are using the same resolver implementation.
I'm experiencing this issue too using Terraform (which is based on GoLang). Very very bad bug. All my network stack crashes and I'm forced to:
Check https://github.com/hashicorp/terraform/issues/31467#issuecomment-1245176316
It is worth to get in touch with macOS network devs, because even if there is some issue in go implementation, network stack shouldn't crash so dramatically, to the point that reboot is the only option.
cc @golang/darwin
I'm using WSL and faced issues with this, so no restricted to Macs :(
Hi @pacorreia,
Given that everything so far has suggested a bad interaction between the pure Go resolver and macOS, I think it would help if you could reproduce the bug template in a comment here, with similar content to the original issue comment but describing the experience on your own system, to help recognize whether you've found another instance of the same problem or if you've found a different problem with similar symptoms.
@apparentlymart Thanks for your suggestion, since this is being experienced through Terraform I will open an issue with them following the guidelines of Go project, then they will be in touch with you if necessary.
What I've found if of interested is that building a binary with CGO_ENABLED=0 go build .
works, setting CGO_ENABLED=1 will produce a binary unable to solve DNS, actually forcing it to go always to IPv6 no matter if the machine has it enabled/disabled.
If enabling CGo is what breaks it for you in WSL then that sounds like the opposite of this issue so far: this issue has been discussing a variation of the classic problem that the C library is basically required to do correct name resolution on macOS, and so disabling CGo seems to be one of the requirements to reproduce this on macOS.
(And, FWIW, the Terraform team has tried to correct this by fixing the build system bug that was causing the darwin_arm64
builds to have CGo disabled, so that all Darwin builds will have CGo enabled moving forward rather than just the amd64 arch as before. Hasn't really been long enough yet to determine if that actually fixed it, but in the bigger picture -- outside of Terraform's needs -- seems like disabling the pure Go resolver doesn't change the fact that the pure Go resolver is apparently doing something strange on macOS.)
If enabling CGo is what breaks it for you in WSL then that sounds like the opposite of this issue so far: this issue has been discussing a variation of the classic problem that the C library is basically required to do correct name resolution on macOS, and so disabling CGo seems to be one of the requirements to reproduce this on macOS.
(And, FWIW, the Terraform team has tried to correct this by fixing the build system bug that was causing the
darwin_arm64
builds to have CGo disabled, so that all Darwin builds will have CGo enabled moving forward rather than just the amd64 arch as before. Hasn't really been long enough yet to determine if that actually fixed it, but in the bigger picture -- outside of Terraform's needs -- seems like disabling the pure Go resolver doesn't change the fact that the pure Go resolver is apparently doing something strange on macOS.)
After some extra debugging, yes, the CGO_ENABLED=1 actually makes it run, while with 0 doesn't and hashicorp is still building the binaries with CGO_ENABLED=0.
Many thanks for your previous answer
I'm using mac M1 Ventura 13.5 (22G74), the local dns issue is actually affected my nslookup
command, but it didn't affect the go program you provide with whether CGO enabled or not.
$ go version
go version go1.20.4 darwin/arm64
go env
GO111MODULE="" GOARCH="arm64" GOBIN="/Users/ckpn/Mine/projects/golang/bin" GOCACHE="/Users/ckpn/Library/Caches/go-build" GOENV="/Users/ckpn/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/ckpn/Mine/projects/golang/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/ckpn/Mine/projects/golang" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/opt/homebrew/Cellar/go/1.20.4/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/opt/homebrew/Cellar/go/1.20.4/libexec/pkg/tool/darwin_arm64" GOVCS="" GOVERSION="go1.20.4" GCCGO="gccgo" AR="ar" CC="cc" CXX="c++" CGO_ENABLED="0" GOMOD="/Users/ckpn/Mine/projects/golang/test/dns/go.mod" GOWORK="" CGO_CFLAGS="-O2 -g" CGO_CPPFLAGS="" CGO_CXXFLAGS="-O2 -g" CGO_FFLAGS="-O2 -g" CGO_LDFLAGS="-O2 -g" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch arm64 -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/3g/rxdhvbj96lv68y35t2cvghkr0000gn/T/go-build2449011362=/tmp/go-build -gno-record-gcc-switches -fno-common"
I'm seeing the same behaviour: go version go1.22.2 darwin/arm64
!
What version of Go are you using (
go version
)?This is also reproduced against go1.17.8, so doesn't appear to be new to go1.18.
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?macOS 12.3, Intel (amd64)
go env
OutputWhat did you do?
Making a DNS lookup against an ipv6 link-local address hangs and fails on macOS when using thenative (non-CGO) resolver.
This appears to manifest within the non-CGO based DNS resolver, but I believe it's more generally applicable to the network stack itself, it's just easier to reproduce through DNS.
The simplest reproduction we could come up with was this compiled with
CGO_ENABLED=0
:Paired with an
/etc/resolv.conf
of:I've deduced this down to explicitly any link-local address (fe80::*) and not generically to ipv6. Other ipv6 addresses, both public and private network appear fine.
This behavior is especially concerning because it seems to cascade into larger macOS failures. All DNS seems to start failing on the whole system requiring ultimately a reboot to restore stability.
Given this, I believe this is also a bug within macOS that Go happens to be exploiting somehow.
I personally don't have a macOS machine to iterate on this, but we've deduced this both with customers running our CLI (https://github.com/planetscale/cli) as well as employees internally.
After a bunch of deducing, we've narrowed it down to the link-local address.
Again, when compiled with CGO_ENABLED=1, everything works correctly and is fine. The issue only occurs when compiled with CGO_ENABLED=0, triggering the pure Go DNS resolution path.
We originally thought this was related to arm64 (M1 Macs), but was able to finally reproduce it on Intel.
What did you expect to see?
Fast DNS resolution
What did you see instead?
Program hangs, eventually times out, in extreme cases, causing entire OS to become unstable.
For additional context: https://github.com/planetscale/discussion/discussions/181