Closed anacrolix closed 5 months ago
I believe it has something to do with DNS over UDP over IPv6. I can reproduce it by running:
go install github.com/anacrolix/publicip/cmd/publicip
publicip -6
While using the Mullvad VPN. I suspect even if it's an edge case and the packet is being intercepted, it shouldn't be panicking my application.
I believe it has something to do with DNS over UDP over IPv6. I can reproduce it by running:
go install github.com/anacrolix/publicip/cmd/publicip publicip -6
While using the Mullvad VPN. I suspect even if it's an edge case and the packet is being intercepted, it shouldn't be panicking my application.
I can run this command without error on linux/amd64, platform related issue?
panic: runtime error: slice bounds out of range [54:45] goroutine 65 [running]: internal/poll.(*FD).Write(0xc0001f6080, {0xc00015e002, 0x2d, 0x200}) /opt/homebrew/Cellar/go/1.20.3/libexec/src/internal/poll/fd_unix.go:383 +0x49c
The relevant block of code (in go1.20.3
) is here:
https://cs.opensource.google/go/go/+/refs/tags/go1.20.3:src/internal/poll/fd_unix.go;l=379-386;drc=a2baae6851a157d662dff7cc508659f66249698a
That would seem to imply that at that point nn
is 54 and max
is 45.
max
is len(p)
or nn + maxRW
, whichever is smaller. (maxRW
is 1 << 30
, so in this case it must be len(p)
.)Write
loop terminates when nn == len(p)
after a call to syscall.Write
with a slice of length max - nn
.nn == len(p)
, and increments nn
by the number of bytes reported by syscall.Write
.Unfortunately, the most plausible explanations both seem unlikely:
syscall.Write
returned an n
larger than len(p[nn:max])
,cgo
, or unsafe
, or a bug in runtime
or syscall
, or a kernel or libc bug?) corrupted some local variable in (*FD).Write
or syscall.Write
or syscall.write
.The latter possibility makes me think of #60449, but note that that is for amd64
whereas this report is for arm64
.
But the fact that this reproduces for you “[w]hile using the Mullvad VPN” makes me wonder if something about the VPN is causing the libc write
call to return an incorrect count. Perhaps (*FD).Write
should check for that explicitly and return an error for it?
(CC @ianlancetaylor, @golang/runtime)
Wait, no. In that stack trace goroutine 65
is running, not panicking. Maybe that goroutine stack is a red herring.
@anacrolix, can you post the complete goroutine dump from a failure?
Thank you for looking into this @bcmills. That's actually the only thing it outputs when it crashes. Here it is with
We actually have a similar issue on Intel Macs (go1.20.3, go1.20.5): panics while connecting to VPN using Cisco AnyConnect. arm64 macs running same go code compiled for Apple Silicon don't panic.
It seems, at some point syscall.Write
returns 33554436
.
panic: runtime error: slice bounds out of range [33554436:43]
goroutine 7738 [running]:
internal/poll.(*FD).Write(0xc000842100, {0xc0005f01e0, 0x2b, 0x1c5})
/usr/local/go/src/internal/poll/fd_unix.go:383 +0x4aa
net.(*netFD).Write(0xc000842100, {0xc0005f01e0?, 0xc000837890?, 0x100ed61e0?})
/usr/local/go/src/net/fd_posix.go:96 +0x29
net.(*conn).Write(0xc000506000, {0xc0005f01e0?, 0xc0008378f0?, 0xc0005f01e0?})
/usr/local/go/src/net/net.go:195 +0x45
crypto/tls.(*Conn).write(0xc0007dd500, {0xc0005f01e0?, 0x5?, 0x1c5?})
/usr/local/go/src/crypto/tls/conn.go:923 +0x10d
crypto/tls.(*Conn).writeRecordLocked(0xc0007dd500, 0x17, {0xc000516000, 0x15, 0x1000})
/usr/local/go/src/crypto/tls/conn.go:991 +0x354
crypto/tls.(*Conn).Write(0x0?, {0xc000516000, 0x15, 0x1000})
/usr/local/go/src/crypto/tls/conn.go:1186 +0x411
net/http.http2stickyErrWriter.Write({{0x101381f18?, 0xc0007dd500?}, 0xc0004a2460?, 0xc0006602c0?}, {0xc000516000, 0x15, 0x1000})
/usr/local/go/src/net/http/h2_bundle.go:7429 +0x149
bufio.(*Writer).Flush(0xc000596340)
/usr/local/go/src/bufio/bufio.go:628 +0x62
net/http.(*http2ClientConn).writeHeaders(0xc000660180, 0x9, 0x0, 0x4000, {0xc00066db00?, 0x0?, 0x2403?})
/usr/local/go/src/net/http/h2_bundle.go:8579 +0x195
net/http.(*http2clientStream).encodeAndWriteHeaders(0xc000acb380, 0xc0007f3900)
/usr/local/go/src/net/http/h2_bundle.go:8455 +0x38e
net/http.(*http2clientStream).writeRequest(0xc000acb380, 0xc0007f3900)
/usr/local/go/src/net/http/h2_bundle.go:8343 +0x528
net/http.(*http2clientStream).doRequest(0xc000ac34f0?, 0xc0005bc701?)
/usr/local/go/src/net/http/h2_bundle.go:8261 +0x1e
created by net/http.(*http2ClientConn).RoundTrip
/usr/local/go/src/net/http/h2_bundle.go:8190 +0x34a
What's interesting is that the left index is always the same 33554436
, if you convert it into hex you will get 0x02000004
, which corresponds to write syscall: (SYSCALL_CLASS_UNIX << SYSCALL_CLASS_SHIFT) + 4
Can this be a trampoline messing with memory layout? or some kind of alignment issue?
UPD: Apple Silicon Macs panic as well but with arbitrary left index.
UPD2: Managed to build minimal repro script, need to be run while VPN client is establishing connection:
package main
import (
"log"
"net/http"
"time"
)
func main() {
for {
time.Sleep(500 * time.Millisecond)
log.Println("hello")
resp, err := http.Get("https://google.com")
if err != nil {
continue
}
resp.Body.Close()
}
}
Disabling HTTP KeepAlives "fixes" the issue. It seems, VPN client rewrites some network routes and when Go tries to reuse a connection from a pool something in libc or kernel breaks and Write() syscall returns incorrect value.
Huh. At the very least (*poll.FD).Write
should probably validate that n < max - nn
and treat the connection as broken.
Even so, we should narrow down whether it is a Go bug or a macOS bug.
darwin/amd64
binary?Personally I don't think that the Go standard library should have to double-check that the write
system call behaves as expected.
It should be possible to use dtruss to verify that write
is returning an impossible length. That would be a useful way to determine whether it is indeed a kernel problem.
Personally I don't think that the Go standard library should have to double-check that the
write
system call behaves as expected.
I agree with that in principle, but I also think that if we have reason to believe that a particular system call may be broken, it benefits our users to make the problem easier to diagnose — and the run-time cost of an else if nn > len(p)
here should be negligible compared to the cost of the syscall.
(I don't think we need to try to rush a check into 1.21 or backport it to older releases, but I do think we should consider it for 1.22 so that if this happens for other users they will be able to figure out what's going on more easily.)
- Can you reproduce this same behavior with a program written in C?
I tried, but no luck so far.
When I run go reproducer under dtruss I can't see any write
calls returning more bytes than number of bytes passed in. So, I think it's not kernel but libc or go runtime bug.
- Does the Go reproducer also reproduce the bug in a
darwin/amd64
binary?
Will try next week when I get access to my Intel Mac.
Disabling ipv6 also "fixes" this issue.
I can run publicip -6 with dtruss and Mullvad if it helps
So, I think it's not kernel but libc or go runtime bug.
I'm inclined to suspect libc
because of the involvement of VPNs. (I wouldn't be surprised if the VPN layer works by injecting itself into libc at link time.)
I also wouldn't be terribly surprised if the reproducer requires multiple threads. If there is a synchronization bug in the VPN layer, the fact that the Go runtime may make its Write
calls from arbitrarily different threads may be important.
I couldn't determine much from the output. Without a VPN I see:
write(0x6, "\347\017\001\0", 0x2D) = 45 0
and with I see:
write(0x6, "\0", 0x2D) = 0 Err#-2
AFAICT, this is the UDP socket it's sending DNS queries from. Let me know if the full dtruss output is useful.
Thanks. Both of those system calls look OK (although error -2 is weird). What we suspect is a case where the system call returns a number that is larger than the third argument. If there are no such calls in dtruss then it could conceivably be a C library problem somehow, as suggested earlier. How is the VPN implemented? Does it somehow intercept calls to the C write
function?
Does the Go reproducer also reproduce the bug in a darwin/amd64 binary?
Just tried the go reproducer on darwin/amd64
, it reproduces the bug.
Behaviour is a bit different though: left index is always 33554436
.
if you convert it into hex you will get 0x02000004, which corresponds to write syscall: (SYSCALL_CLASS_UNIX << SYSCALL_CLASS_SHIFT) + 4
The VPN is Mullvad. I only know that it's using wireguard underneath.
I also get the same panic when using the same program under wine64 on Darwin. I'm inclined to think this is a Go resolver issue from the original stack trace I provided, and not Darwin specific.
@anacrolix, if this is a kernel (or perhaps even a libc) bug, wouldn't wine64 on Darwin be subject to that bug too?
Wine “translates Windows API calls into POSIX calls on-the-fly”, so presumably it would end up in the same POSIX write(2)
call under the hood, and if Wine itself doesn't check the result it could very easily write the bogus value obtained from libc
back through the lpNumberOfBytesWritten
argument of the translated WriteFile
call.
Sorry I forgot to follow up, I tested on Windows directly and there was no issue as far as I can recall.
Can't reproduce this bug on macOS Sonoma Beta 7 release, probably Apple fixed it.
I'm also on Sonoma and can't reproduce it there. Yes, my original app, and the publicip -6
reproduction above both work.
Can't reproduce this bug on macOS Sonoma Beta 7 release, probably Apple fixed it.
Apple "un-fixed" it in Sonoma Stable :( Most likely related to pf
bug fix @anacrolix mentioned.
Can't reproduce panic while running go repro script together with while true; do sudo pfctl -d; done
I'll try again with Sonoma stable and report back.
Yes, it crashes again on Sonoma stable (14.0 (23A344)
).
panic: runtime error: slice bounds out of range [48:45]
goroutine 19 [running]:
internal/poll.(*FD).Write(0x1400018a000, {0x140000e2002, 0x2d, 0x200})
/Users/anacrolix/src/go1.21/src/internal/poll/fd_unix.go:380 +0x3ac
net.(*netFD).Write(0x1400018a000, {0x140000e2002?, 0x140000e0048?, 0x104342a0c?})
/Users/anacrolix/src/go1.21/src/net/fd_posix.go:96 +0x28
net.(*conn).Write(0x1400018e000, {0x140000e2002?, 0x140000dffe8?, 0x1043372e4?})
/Users/anacrolix/src/go1.21/src/net/net.go:191 +0x34
net.dnsPacketRoundTrip({_, _}, _, {{{0x6d, 0x79, 0x69, 0x70, 0x2e, 0x6f, 0x70, ...}, ...}, ...}, ...)
/Users/anacrolix/src/go1.21/src/net/dnsclient_unix.go:102 +0x74
net.(*Resolver).exchange(_, {_, _}, {_, _}, {{{0x6d, 0x79, 0x69, 0x70, 0x2e, ...}, ...}, ...}, ...)
/Users/anacrolix/src/go1.21/src/net/dnsclient_unix.go:187 +0x394
net.(*Resolver).tryOneName(_, {_, _}, _, {_, _}, _)
/Users/anacrolix/src/go1.21/src/net/dnsclient_unix.go:277 +0x370
net.(*Resolver).goLookupIPCNAMEOrder.func3.1(0x1c?)
/Users/anacrolix/src/go1.21/src/net/dnsclient_unix.go:653 +0x68
created by net.(*Resolver).goLookupIPCNAMEOrder.func3 in goroutine 18
/Users/anacrolix/src/go1.21/src/net/dnsclient_unix.go:652 +0x15c
I'm having the same issue, running Sonoma 14.1.1
and Go 1.21.3
This happens when my device is cut off from the network with firewall rules, but DNS IPs are whitelisted and I try to hit a host.
Without going into too much detail I'm setting up some pfctl rules, starting with block all
and then allowing some select traffic to pass through.
This has been working fine for a while, just started to encounter this with either an updated Go version or updated Mac version I think?
I also see the exact same error as above: panic: runtime error: slice bounds out of range [33554436:69]
Change https://go.dev/cl/577955 mentions this issue: internal/poll: better panic for invalid write return value
I've improved the panic message so that it clearly indicates the problem. I don't think there is anything else to do here. If the write
system call reports that it wrote more bytes than we asked it to write, there is no reasonable way for us to continue. We have no idea how many bytes were actually written.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
It occurs with 1.21, 1.20, and earlier version too (untested)
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Probably try to resolve an IP
What did you expect to see?
It resolve
What did you see instead?