golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.42k stars 17.71k forks source link

net: 512 byte DNS response size limit causes "cannot unmarshal DNS" error #51127

Closed AaronFriel closed 2 years ago

AaronFriel commented 2 years ago

So, you found this issue googling for "cannot unmarshal DNS"

There's good news: your issue has largely been fixed. The issue below was created initially because I discovered it in my network and operating system, but further discovery found that this issue has affected every major OS and users of VPNs, DNS providers written in Go, and more.

If you are a maintainer of code and someone has reported this issue: if you can update your build system to use Go 1.16.15 or 1.17.8, or Go 1.18, then you should see this go away and solve your users' issues.

If you are a user of a program and see this error, you need to ask the maintainer or creator of that package to do likewise. Unfortunately, there isn't a single set of instructions I can give for a workaround. If you're using a VPN, try using that program not on a VPN; that seems to be the most common user-reported scenario I've seen.


Original bug report:

What version of Go are you using (go version)?

$ go version
go version go1.17.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

Note: WSL2 on Windows. This is relevant, but not the sole scenario in which it can occur, see below.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/friel/.cache/go-build"
GOENV="/home/friel/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/friel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/friel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/friel/.local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/friel/.local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/friel/go/src/github.com/pulumi/pulumi-yaml/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3112884807=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Use infrastructure as code tools to manage Azure, and/or attempt to execute net.LookupIP("management.azure.com").

Example program:

package main

import (
    "fmt"
    "net"
)

func main() {
    ips, err := net.LookupIP("management.azure.com")
    if err != nil {
        panic(err)
    }
    for _, ip := range ips {
        fmt.Printf("%v", ip)
    }
}

What did you expect to see?

I expected to see the current IP, 13.86.219.80, as shown by the last line of:

$ host management.azure.com
management.azure.com is an alias for management.privatelink.azure.com.
management.privatelink.azure.com is an alias for arm-frontdoor-prod.trafficmanager.net.
arm-frontdoor-prod.trafficmanager.net is an alias for westus.management.azure.com.
westus.management.azure.com is an alias for arm-frontdoor-westus.trafficmanager.net.
arm-frontdoor-westus.trafficmanager.net is an alias for westus.cs.management.azure.com.
westus.cs.management.azure.com is an alias for rpfd-prod-by-01.cloudapp.net.
rpfd-prod-by-01.cloudapp.net has address 13.86.219.80

What did you see instead?

$ go run resolve-test.go 
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message

goroutine 1 [running]:
main.main()
        /home/friel/c/resolve-test/resolve-test.go:11 +0xe8
exit status 2

Miscellany

It looks like this issue is widely affecting infrastructure as code tools such as Pulumi, Terraform, and others when they make API calls to Microsoft Azure on the Windows Subsystem for Linux 2, on Microsoft Windows.

This is a bit of a rock and a hard place situation. Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification. The Go language team is in a position to be much more agile and issue a point release update to support a larger buffer size, even just going up to a single standard MTU of ~1500 bytes would resolve this issue in the near term.

As this problem primarily affects programs written in Go, in this author's estimation it seems unlikely a change in Windows' DNS server behavior could occur as quickly, even if the stars were to align on the need to change the implementation. Note that host, dig, nslookup, etc all behave correctly.

Collected notes and root cause analysis:

DNS Flag Day 2020 had an explicit goal of ensuring that resolvers had a minimum accepted buffer size of 1232 bytes: https://dnsflagday.net/2020/#action-dns-resolver-operators

gopherbot commented 2 years ago

Change https://go.dev/cl/384076 mentions this issue: net/dns: Increase UDP response buffer to 1232 bytes

mvdan commented 2 years ago

cc @ianlancetaylor @neild

gutzi commented 2 years ago

Workaround: We were able to work around the problem by adding a DNS entry in the hosts file: 51.107.60.33 management.azure.com When using WSL, the hostfile can be edited in Windows. %windir%\system32\drivers\etc\hosts and then restart the WSL. So at least we could use Terraform again.

AaronFriel commented 2 years ago

For what it's worth, there is no generally applicable workaround that fixes users' experience without other side effects and possible downsides.

That IP isn't the same IP I see, so I wonder if there's some geographic DNS response occurring.

seankhliao commented 2 years ago

previously #11070

seankhliao commented 2 years ago

Even from the linked site, the recommendation for the increased buffer size is for EDNS0 which is not implemented here (ref #6464). Equally important on their site is the support for TCP, and had WSL followed spec and returned a proper truncated response, it would have been retried gracefully.

AaronFriel commented 2 years ago

@seankhliao

I would push back on the notion that this should be resolved elsewhere.

Go is the exception to behaving correctly: other userland programs such as dig(1), nslookup(1), host(1), as well as glibc API calls such as getaddrinfo(3) work. I can write Python, C#, Rust, C, etc, and those will work correctly in this networking environment.

Go is adhering strictly to an antiquated standard, EDNS0 has been a standard since 1999 and larger responses are not a new specification or the result of rapidly moving network standards or the ground shifting under Go. Strict adherence to 512 byte responses is not followed by other tools in the same ecosystem, Go ought to "be liberal in what it accepts", within reason and of course, unless doing so would violate memory safety or other safety criteria of the software.

End-users are not in a position to solve their upstream DNS server's issues, nor are software maintainers. We don't have control over our end user's DNS servers.

This error isn't unique to the situation I described, it's just most acute right now for those users in the specific scenario I documented. 112 issues have been reported on GitHub with the text "cannot unmarshal DNS", and a survey of those shows that they have occurred across all platforms and among extraordinarily widely used pieces of software across Mac, Windows, *nix. Those issues show that various other VPN providers, ISPs, routers, have all behaved similarly. And going back to the earlier points, users don't have control over those things and we shouldn't expect all Go software users to be software engineers or to be able to modify their DNS configuration.

Lastly, I strongly believe that software that works is superior to software that does not, and end-users of the software will not care what link in the chain is causing it not to work.

There is an opportunity to mitigate an issue end-users are facing in one place, I think bringing Golang into alignment with the rest of the ecosystem will positively impact users.

mdempsky commented 2 years ago

Thanks for the report.

Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification.

Why not? It's been a while since I've read DNS RFCs, but my impression is still today that DNS servers are not allowed to send >512-byte responses unless the client explicitly indicates support for such using EDNS.

As such, I feel like emphasizing "pre-1999" is unfair. I think Microsoft should update their DNS server to adhere to the DNS specification. I'd prefer we don't add hacks to accommodate non-spec behavior.

However, #6464 remains open if someone wants to update Go's DNS client to use EDNS, and to support+advertise a larger buffer size. I think that's the standards-conforming way to address this issue, if folks aren't willing to wait on the issue being fixed in WSL2.

AaronFriel commented 2 years ago

Hey @mdempsky I would like this re-opened please. Any way we could get on a call to chat about this?

leitzler commented 2 years ago

Just for reference, there is an (currently open) issue over at WSL that should cover this issue https://github.com/microsoft/WSL/issues/7642. I'd suggest adding your findings there as well.

AaronFriel commented 2 years ago

Understood, though I'd like to chat with someone on the Go language team about the scope & impact of this issue. It's affecting customers of major Go language-built software & has for about seven years. It's particularly acute because, I suspect, none of the players wants to take responsibility for fixing this.

End users do not care why their software is broken, but we have an opportunity here to address, at least partially, thousands of issues raised by users over the past 7 years. And if the Pareto principle is applicable here, I suspect those users knowledgeable enough and motivated enough to comment on GitHub are just a fraction of those impacted.

mdempsky commented 2 years ago

Hey @mdempsky I would like this re-opened please. Any way we could get on a call to chat about this?

Why? What do you hope these requests would accomplish?

As stated, the Go DNS client is spec-compliant to the best of my knowledge, and a feature request issue (#6464) already exists that I believe would make it more accommodating to non-compliant DNS software like WSL2. It just needs someone to implement it. I'm happy to review CLs.

leitzler commented 2 years ago

With all due respect, Go is a open source project and I think that your best bet to get a desired change through isn't via a private call with a maintainer.

AaronFriel commented 2 years ago

Other languages & libraries use larger buffers and accept larger responses in order to "be liberal in what they accept" to tolerate non-compliant implementations, and a concerted effort by a consortium of DNS implementations and stakeholders pushed for a larger acceptable buffer size in 2020, more than two decades after that specification was accepted.

And end users do not care why their software does not work. I think a phone call might be a better channel to have an empathetic conversation over the issues I've read & the litany of closed/unsolved issues reported against packages on GitHub, StackOverflow, and elsewhere.

Otherwise, I can keep replying, but I don't see any responses to my points on the merits so far. I would like to raise the bar from this text-based conversation to one that's more empathetic toward end-users.

I think we should try, here, to solve customer, end-user problems.

mdempsky commented 2 years ago

I think we should try, here, to solve customer, end-user problems.

We've identified two ways to do that already: have WSL2 fix their DNS server (https://github.com/microsoft/WSL/issues/7642), or implement #6464.

AaronFriel commented 2 years ago

Would anything break by using a larger default buffer for responses? I think that's what glibc does, and as observed previously I think Go is an outlier here among languages & libraries in not tolerating a larger response.

ianlancetaylor commented 2 years ago

There is something I don't understand here. Apparently some DNS server is out of spec by sending packets greater than 512 bytes without setting the truncated bit. But it can't be the regular Microsoft server, or Go programs on all systems would be reporting problems, not just programs on WSL. Does WSL run a local name server? What is the nameserver entry in resolv.conf? What happens if you change it to 8.8.8.8 or 1.1.1.1?

CC @jstarks for WSL issue.

AaronFriel commented 2 years ago

@ianlancetaylor First, you're right, the WSL2 DNS server is out of spec. No question there.

Second, let's take a step back - this isn't a WSL2 specific issue. Fixing the acute issue users are facing in WSL2 is WSL2 specific, but I'd encourage you to read the many, many comments on GitHub issues. https://github.com/search?o=asc&q=%22cannot+unmarshal+DNS%22&s=created&type=Issues

Starting with these issues which predate WSL2.

I'm using a red circle to indicate that a user's problem was never solved, a yellow circle to indicate that a workaround was implemented to mitigate customer issues, but didn't root cause them, and a green circle when a project that is actually a DNS server solved the issue. I'm also using GitHub Markdown's list notation to provide partially unfurled data about the link destination via just pasting in URLs.

Consul

Confd

Docker

Kubernetes

Weave

rakyll/drive / odeke-em/drive

Mesos, again

Resolvable, a Docker DNS resolver

Goproxy

Moby / then Docker

freegeoip

heroku

clair

Docker for Mac

gorush application server

Docker for Mac

AaronFriel commented 2 years ago

I think that software that works is better than software that doesn't work, and if a partial mitigation before EDNS0 support lands in Go would have prevented these issues, shouldn't it have been done? How many frustrated users is too many?

That's just the first two pages of results from the GitHub issues. I'll continue tomorrow.

gopherbot commented 2 years ago

Change https://go.dev/cl/385035 mentions this issue: net: send EDNS(0) packet length in DNS query

ianlancetaylor commented 2 years ago

@AaronFriel Can you or someone else with WSL see if https://go.dev/cl/385035 fixes the problem? That CL uses EDNS(0) to advertise a permitted packet size of 1232 bytes.

Although I have to say that if there are DNS servers out there that incorrectly send responses larger than 512 bytes in the absence of an EDNS(0) packet length, then I suspect that there are DNS servers out there that will simply ignore the EDNS(0) packet length and send whatever packet size they feel like. So I don't know how much this will actually help.

AaronFriel commented 2 years ago

@ianlancetaylor I can, with great enthusiasm, report that your CL causes the test case to pass in the issue.

🎉🎉🎉🎉🎉

It took me a bit to figure out how to check out the CL - I used the base64 encoded blob, not sure if that's the easiest way to do it - but I did build Go locally. And the result of running the test command is starkly different.

Go 1.17.6:

$ ~/.local/go/bin.1.17.6/go run resolve-test.go 
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message

goroutine 1 [running]:
main.main()
        /home/friel/c/tmp/resolve-test.go:11 +0xe8
exit status 2

With patch applied:

$ ~/c/gh/go/bin/go run resolve-test.go 
13.86.219.80

I rebuilt the Pulumi toolchain that a user reported this error on and which I was able to reproduce, and I can confirm that issue is mitigated as well.

I anticipate this would resolve issues for our friends and colleagues in the infra-as-code ecosystem, as well as anyone else using Go tooling to manage or authenticate with Azure, and likely many of issues folks experienced with non-conforming DNS resolvers out of their control due to being part of a proxy, VPNs, their ISP's routers or otherwise.

If this could be included in the next dot release of Go, I would be eternally grateful. 🙇

mateusz834 commented 2 years ago

@ianlancetaylor There is a special edns0 option in /etc/resolv.conf that indicates the support of ends0. Maybe is should be used?? I mean if options ends0 is not set then glibc does not send dns packets with ends0.

First query without options ends0, second with. Screenshot_20220211_081825

AaronFriel commented 2 years ago

@mateusz834 I see the same, though glibc still correctly handles the oversize response:

(Alt text: the following image shows a Wireshark packet capture, depicting a DNS request from the local machine's IP to the WSL2 DNS server. The EDNS0 bit is not set in the options for the request. The response size from the DNS server is 586 bytes.)

image

mateusz834 commented 2 years ago

Is the options edns0 set by WSL2, by default?? I think that ends0 should be opt-in, like in glibc. But if WLS2 does not add options edns0 by default then it still won't fix it.

ianlancetaylor commented 2 years ago

At this stage of DNS I don't see a reason to make EDNS(0) opt-in. It was always intended to be fully backward compatible. The edns0 option was added to glibc in 2007. I think it's safe to use by default today.

mateusz834 commented 2 years ago

Ok, Also I think here net/dnsconfig_unix.go a empty case should be added for edns0 to not cause fallback to cgo resolver.

ianlancetaylor commented 2 years ago

Thanks, I'll do that as a follow-up CL if this ones moves forward.

mateusz834 commented 2 years ago

And I would also consider doing the same with the trust-ad option. Golang does not use the ad flag anywhere, so it should be safe to silently ignore that option and not cause fallback to cgo.

mdempsky commented 2 years ago

For new feature requests you'd like to see considered/implemented, please file new issues. Thanks.

ianlancetaylor commented 2 years ago

Filed #51153 as a freeze request to include https://go.dev/cl/385035 in 1.18.

gopherbot commented 2 years ago

Change https://go.dev/cl/385374 mentions this issue: [release-branch.go1.17] net: send EDNS(0) packet length in DNS query

gopherbot commented 2 years ago

Change https://go.dev/cl/385375 mentions this issue: [release-branch.go1.16] net: send EDNS(0) packet length in DNS query

gopherbot commented 2 years ago

Change https://go.dev/cl/386015 mentions this issue: net: increase maximum accepted DNS packet to 1232 bytes

gopherbot commented 2 years ago

Change https://go.dev/cl/386016 mentions this issue: net: send EDNS(0) packet length in DNS query

gopherbot commented 2 years ago

Change https://go.dev/cl/386014 mentions this issue: Revert "net: send EDNS(0) packet length in DNS query"

gopherbot commented 2 years ago

Change https://go.dev/cl/386034 mentions this issue: [release-branch.go1.16] net: increase maximum accepted DNS packet to 1232 bytes

gopherbot commented 2 years ago

Change https://go.dev/cl/386035 mentions this issue: [release-branch.go1.17] net: increase maximum accepted DNS packet to 1232 bytes