golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.95k stars 17.53k forks source link

net: pure Go DNS resolver hangs with ipv6 link-local address on macOS #52839

Open mattrobenolt opened 2 years ago

mattrobenolt commented 2 years ago

What version of Go are you using (go version)?

$ go version
go version go1.18 darwin/amd64

This is also reproduced against go1.17.8, so doesn't appear to be new to go1.18.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

macOS 12.3, Intel (amd64)

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/xxx/Library/Caches/go-build"
GOENV="/Users/xxx/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/xxx/go/pkg/mod"
GOOS="darwin"
GOPATH="/Users/xxx/go"
GOPROXY="https://proxy.golang.org/,direct"
GOROOT="/usr/local/Cellar/go/1.18/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.18/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.18"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/pb/fk1_tmbn1nq32bgpd6njflcw0000gn/T/go-build3259567292=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Making a DNS lookup against an ipv6 link-local address hangs and fails on macOS when using thenative (non-CGO) resolver.

This appears to manifest within the non-CGO based DNS resolver, but I believe it's more generally applicable to the network stack itself, it's just easier to reproduce through DNS.

The simplest reproduction we could come up with was this compiled with CGO_ENABLED=0:

package main

import (
    "context"
    "fmt"
    "net"
    "os"
    "time"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    ips, err := net.DefaultResolver.LookupIP(ctx, "ip", "google.com")
    if err != nil {
        fmt.Fprintf(os.Stderr, "Could not get IPs: %v\n", err)
        os.Exit(1)
    }
    for _, ip := range ips {
        fmt.Printf("google.com. IN A %s\n", ip.String())
    }
}

Paired with an /etc/resolv.conf of:

nameserver fe80::a04e:cfff:fe2f:ad64%en0
nameserver 10.0.0.1

I've deduced this down to explicitly any link-local address (fe80::*) and not generically to ipv6. Other ipv6 addresses, both public and private network appear fine.

This behavior is especially concerning because it seems to cascade into larger macOS failures. All DNS seems to start failing on the whole system requiring ultimately a reboot to restore stability.

Given this, I believe this is also a bug within macOS that Go happens to be exploiting somehow.

I personally don't have a macOS machine to iterate on this, but we've deduced this both with customers running our CLI (https://github.com/planetscale/cli) as well as employees internally.

After a bunch of deducing, we've narrowed it down to the link-local address.

Again, when compiled with CGO_ENABLED=1, everything works correctly and is fine. The issue only occurs when compiled with CGO_ENABLED=0, triggering the pure Go DNS resolution path.

We originally thought this was related to arm64 (M1 Macs), but was able to finally reproduce it on Intel.

What did you expect to see?

Fast DNS resolution

What did you see instead?

Program hangs, eventually times out, in extreme cases, causing entire OS to become unstable.

For additional context: https://github.com/planetscale/discussion/discussions/181

heschi commented 2 years ago

cc @ianlancetaylor @neild

dannyfallon commented 2 years ago

We reproduced this with Go 1.18.4 and Go 1.16.5 on M1 and Intel machines running OS X 12.3-12.5 with two different Go applications. All affected users had the fe80::* IPv6 DNS address in their nameservers. The behaviour was always what the OP described as the extreme circumstances - hung process that ignores signals, eventual network disconnection and then a frozen machine needing forced reboots.

Changing DNS servers on the network connection to drop the fe80::* address resolved it. As a mitigation we also started to build OS X binaries on OS X machines with Cgo enabled to force the use of the Cgo resolver. We lost several days debugging this ourselves and when I came to report this I found this issue already. Is there any update or ETA on when investigation might take place? This seems fairly critical.

apparentlymart commented 2 years ago

One possible way to confound a reproduction on darwin_amd64 is if a program is being built for both darwin_amd64 and darwin_arm64 on a build worker whose native toolchain is the darwin_arm64 one. In that situation the default for CGO_ENABLED is different between the two targets: darwin_arm64 is considered to be a cross-compile and therefore CGO_ENABLED defaults to off.

I wrote some more about this in a comment on the general issue about the pure Go resolver on macOS. The short version is that if you want to reproduce it reliably it's important to explicitly set CGO_ENABLED=0 or CGO_ENABLED=1 consistently across builds for both architectures, and not rely on the default. That will then ensure that both of your builds are using the same resolver implementation.

silvestriluca commented 2 years ago

I'm experiencing this issue too using Terraform (which is based on GoLang). Very very bad bug. All my network stack crashes and I'm forced to:

Check https://github.com/hashicorp/terraform/issues/31467#issuecomment-1245176316

It is worth to get in touch with macOS network devs, because even if there is some issue in go implementation, network stack shouldn't crash so dramatically, to the point that reboot is the only option.

heschi commented 2 years ago

cc @golang/darwin

pacorreia commented 1 year ago

I'm using WSL and faced issues with this, so no restricted to Macs :(

apparentlymart commented 1 year ago

Hi @pacorreia,

Given that everything so far has suggested a bad interaction between the pure Go resolver and macOS, I think it would help if you could reproduce the bug template in a comment here, with similar content to the original issue comment but describing the experience on your own system, to help recognize whether you've found another instance of the same problem or if you've found a different problem with similar symptoms.

pacorreia commented 1 year ago

@apparentlymart Thanks for your suggestion, since this is being experienced through Terraform I will open an issue with them following the guidelines of Go project, then they will be in touch with you if necessary.

What I've found if of interested is that building a binary with CGO_ENABLED=0 go build . works, setting CGO_ENABLED=1 will produce a binary unable to solve DNS, actually forcing it to go always to IPv6 no matter if the machine has it enabled/disabled.

apparentlymart commented 1 year ago

If enabling CGo is what breaks it for you in WSL then that sounds like the opposite of this issue so far: this issue has been discussing a variation of the classic problem that the C library is basically required to do correct name resolution on macOS, and so disabling CGo seems to be one of the requirements to reproduce this on macOS.

(And, FWIW, the Terraform team has tried to correct this by fixing the build system bug that was causing the darwin_arm64 builds to have CGo disabled, so that all Darwin builds will have CGo enabled moving forward rather than just the amd64 arch as before. Hasn't really been long enough yet to determine if that actually fixed it, but in the bigger picture -- outside of Terraform's needs -- seems like disabling the pure Go resolver doesn't change the fact that the pure Go resolver is apparently doing something strange on macOS.)

pacorreia commented 1 year ago

If enabling CGo is what breaks it for you in WSL then that sounds like the opposite of this issue so far: this issue has been discussing a variation of the classic problem that the C library is basically required to do correct name resolution on macOS, and so disabling CGo seems to be one of the requirements to reproduce this on macOS.

(And, FWIW, the Terraform team has tried to correct this by fixing the build system bug that was causing the darwin_arm64 builds to have CGo disabled, so that all Darwin builds will have CGo enabled moving forward rather than just the amd64 arch as before. Hasn't really been long enough yet to determine if that actually fixed it, but in the bigger picture -- outside of Terraform's needs -- seems like disabling the pure Go resolver doesn't change the fact that the pure Go resolver is apparently doing something strange on macOS.)

After some extra debugging, yes, the CGO_ENABLED=1 actually makes it run, while with 0 doesn't and hashicorp is still building the binaries with CGO_ENABLED=0.

Many thanks for your previous answer

zyxkad commented 1 year ago

I'm using mac M1 Ventura 13.5 (22G74), the local dns issue is actually affected my nslookup command, but it didn't affect the go program you provide with whether CGO enabled or not.

$ go version
go version go1.20.4 darwin/arm64
go env
GO111MODULE=""
GOARCH="arm64"
GOBIN="/Users/ckpn/Mine/projects/golang/bin"
GOCACHE="/Users/ckpn/Library/Caches/go-build"
GOENV="/Users/ckpn/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/ckpn/Mine/projects/golang/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/ckpn/Mine/projects/golang"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/Cellar/go/1.20.4/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/Cellar/go/1.20.4/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.4"
GCCGO="gccgo"
AR="ar"
CC="cc"
CXX="c++"
CGO_ENABLED="0"
GOMOD="/Users/ckpn/Mine/projects/golang/test/dns/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/3g/rxdhvbj96lv68y35t2cvghkr0000gn/T/go-build2449011362=/tmp/go-build -gno-record-gcc-switches -fno-common"
lukasmalkmus commented 5 months ago

I'm seeing the same behaviour: go version go1.22.2 darwin/arm64!