golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

net/http: Segmentation violation when using an IPSec VPN Tunnel (Go 1.19.8/Go 1.20.4/Go 1.21.4) #64020

Open yznima opened 1 year ago

yznima commented 1 year ago

This is continuation of https://github.com/golang/go/issues/61552 since that one is closed.

What version of Go are you using (go version)?

$ go version
go version go1.20.4 darwin/arm64

Does this issue reproduce with the latest release?

Yes. All versions after go.19, and including, reproduce the issue.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="arm64"
GOBIN=""
GOCACHE="{HOME}/Library/Caches/go-build"
GOENV="{HOME}/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="{HOME}/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/nima.yahyazadeh/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/opt/go/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.4"
GCCGO="gccgo"
AR="ar"
CC="cc"
CXX="c++"
CGO_ENABLED="1"
GOMOD="{XXX}/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/s2/3dxjlr5n0d910kqqgq_lcp1w0000gp/T/go-build3602641218=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I compiled a client code and server code as following on my M1 Mac and uploaded to a linux Server.

For client I compiled using GOOS=linux GOARCH=amd64 go build -o client

package main

import (
    "bytes"
    "context"
    "flag"
    "fmt"
    "io"
    "io/ioutil"
    "net/http"
    "strings"
    "time"
)

var hostsFlag = flag.String("hosts", "localhost", "")
var portFlag = flag.Int("port", 3000, "")

func main() {
    flag.Parse()

    hosts := strings.Split(*hostsFlag, ",")
    ports := []int{*portFlag}

    <-time.After(2 * time.Second)

    for _, h := range hosts {
        for _, p := range ports {
            client := http.DefaultClient
            for i := 0; i < 100; i++ {
                go func(host string, port int) {
                    n := 0
                    for {
                        n++
                        if n%10 == 0 {
                            println(host, n)
                        }
                        write(client, host, port)
                        writeCancel(client, host, port)
                    }
                }(h, p)
            }
        }
    }

    select {}
}

func write(client *http.Client, host string, port int) {
    req, err := http.NewRequest(http.MethodPost, fmt.Sprintf("http://%s:%d/hello", host, port), bytes.NewReader([]byte(`{}`)))
    if err != nil {
        panic(err)
    }

    resp, err := client.Do(req)
    defer CloseSafe(resp)
    if err != nil {
        if err != nil {
            println(err.Error())
        }
        return
    }

    handleError(resp)
}

func writeCancel(client *http.Client, host string, port int) {
    ctx, cancel := context.WithTimeout(context.Background(), time.Second)
    req, err := http.NewRequestWithContext(ctx, http.MethodPost, fmt.Sprintf("http://%s:%d/hello", host, port), bytes.NewReader([]byte(`{}`)))
    if err != nil {
        panic(err)
    }

    go cancel()

    resp, err := client.Do(req)
    defer CloseSafe(resp)
}

func handleError(resp *http.Response) {
    if resp.StatusCode != http.StatusOK {
        _, err := io.ReadAll(resp.Body)
        if err != nil {
            panic(err)
        }
    }
}

func CloseSafe(resp *http.Response) {
    if resp == nil || resp.Body == nil {
        return
    }

    _, _ = io.Copy(ioutil.Discard, resp.Body)
    _ = resp.Body.Close()
}

For server I compiled using GOOS=linux GOARCH=amd64 go build -o server

package main

import (
    "fmt"
    "net/http"
)

func main() {
    err := http.ListenAndServe("0.0.0.0:3000", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        http.Error(w, http.StatusText(http.StatusUnauthorized), http.StatusUnauthorized)
    }))

    fmt.Println(err)
}

I ran the server in 3 different EC2 instances and in docker.

FROM scratch

COPY server .

ENTRYPOINT ["./server"]

I used the following command to run it.

docker build . -t test:latest
docker run --rm --network host test:latest

When I ran the server on the three different machines using host networking and then exected the client I see the client panics with the following stacktraces

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x44bb66]

goroutine 305 [running]:
runtime.throw({0x6d35fe?, 0xc000217428?})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:1047 +0x5d fp=0xc000217368 sp=0xc000217338 pc=0x43439d
runtime.sigpanic()
        /opt/homebrew/opt/go/libexec/src/runtime/signal_unix.go:821 +0x3e9 fp=0xc0002173c8 sp=0xc000217368 pc=0x44a429
runtime.(*waitq).enqueue(...)
        /opt/homebrew/opt/go/libexec/src/runtime/chan.go:766
runtime.selectgo(0xc000217750, 0xc000217558, 0x6c828e?, 0x0, 0xc000217540?, 0x1)
        /opt/homebrew/opt/go/libexec/src/runtime/select.go:316 +0x6c6 fp=0xc000217508 sp=0xc0002173c8 pc=0x4467e6
net/http.(*persistConn).roundTrip(0xc000a7b9e0, 0xc001027e40)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2638 +0x994 fp=0xc0002177c0 sp=0xc000217508 pc=0x6461b4
net/http.(*Transport).roundTrip(0x8cdda0, 0xc000223500)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:603 +0x7fa fp=0xc0002179e8 sp=0xc0002177c0 pc=0x63a2da
net/http.(*Transport).RoundTrip(0x26?, 0x73aea0?)
        /opt/homebrew/opt/go/libexec/src/net/http/roundtrip.go:17 +0x19 fp=0xc000217a08 sp=0xc0002179e8 pc=0x6247d9
net/http.send(0xc000223500, {0x73aea0, 0x8cdda0}, {0x8?, 0x6bac80?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:252 +0x5f7 fp=0xc000217c00 sp=0xc000217a08 pc=0x5edef7
net/http.(*Client).send(0x8d3d40, 0xc000223500, {0xc001095520?, 0x6?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:176 +0x9b fp=0xc000217c78 sp=0xc000217c00 pc=0x5ed77b
net/http.(*Client).do(0x8d3d40, 0xc000223500)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:716 +0x8fb fp=0xc000217e80 sp=0xc000217c78 pc=0x5ef99b
net/http.(*Client).Do(...)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:582
main.write(0x73b140?, {0x6ca23e?, 0xc000217f98?}, 0x1?, {0x1?, 0x0?})
....
fatal error: runtime: sudog with non-nil next

goroutine 140 [running]:
runtime.throw({0x6cf467?, 0x4e636c?})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:1047 +0x5d fp=0xc00016f1b8 sp=0xc00016f188 pc=0x43439d
runtime.releaseSudog(0xc00016f2e0?)
        /opt/homebrew/opt/go/libexec/src/runtime/proc.go:443 +0x316 fp=0xc00016f228 sp=0xc00016f1b8 pc=0x4377f6
runtime.semacquire1(0xc0011169a8, 0xa0?, 0x1, 0x0, 0x40?)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:168 +0x2b9 fp=0xc00016f290 sp=0xc00016f228 pc=0x447a79
internal/poll.runtime_Semacquire(0x7f75faa8b418?)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:67 +0x27 fp=0xc00016f2c8 sp=0xc00016f290 pc=0x461227
internal/poll.(*FD).Close(0xc001116980)
        /opt/homebrew/opt/go/libexec/src/internal/poll/fd_unix.go:116 +0x6d fp=0xc00016f2f0 sp=0xc00016f2c8 pc=0x4ad58d
net.(*netFD).Close(0xc001116980)
        /opt/homebrew/opt/go/libexec/src/net/fd_posix.go:37 +0x38 fp=0xc00016f320 sp=0xc00016f2f0 pc=0x4dde18
net.(*conn).Close(0xc0005e0340)
        /opt/homebrew/opt/go/libexec/src/net/net.go:207 +0x45 fp=0xc00016f350 sp=0xc00016f320 pc=0x4e8fc5
net.(*TCPConn).Close(0x445f93?)
        <autogenerated>:1 +0x25 fp=0xc00016f368 sp=0xc00016f350 pc=0x4f47a5
net/http.(*persistConn).closeLocked(0xc000b9e120, {0x73ac60, 0xc00008ca40})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2732 +0x130 fp=0xc00016f3f0 sp=0xc00016f368 pc=0x646bb0
net/http.(*persistConn).cancelRequest(0x4112e4?, {0x73ac60?, 0xc00008c170?})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:1991 +0xe5 fp=0xc00016f450 sp=0xc00016f3f0 pc=0x642f65
net/http.(*persistConn).cancelRequest-fm({0x73ac60?, 0xc00008c170?})
        <autogenerated>:1 +0x39 fp=0xc00016f478 sp=0xc00016f450 pc=0x651f19
net/http.(*Transport).cancelRequest(0x8cdda0, {0xc00016f528?}, {0x73ac60, 0xc00008c170})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:810 +0xee fp=0xc00016f4d8 sp=0xc00016f478 pc=0x63b36e
net/http.(*persistConn).roundTrip(0xc000b9e120, 0xc000cc2380)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2684 +0xbac fp=0xc00016f790 sp=0xc00016f4d8 pc=0x6463cc
net/http.(*Transport).roundTrip(0x8cdda0, 0xc0007d1a00)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:603 +0x7fa fp=0xc00016f9b8 sp=0xc00016f790 pc=0x63a2da
net/http.(*Transport).RoundTrip(0xc00016f9f8?, 0x73aea0?)
        /opt/homebrew/opt/go/libexec/src/net/http/roundtrip.go:17 +0x19 fp=0xc00016f9d8 sp=0xc00016f9b8 pc=0x6247d9
net/http.send(0xc0007d1a00, {0x73aea0, 0x8cdda0}, {0x8?, 0x6bac80?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:252 +0x5f7 fp=0xc00016fbd0 sp=0xc00016f9d8 pc=0x5edef7
net/http.(*Client).send(0x8d3d40, 0xc0007d1a00, {0xc000ded2b0?, 0x6?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:176 +0x9b fp=0xc00016fc48 sp=0xc00016fbd0 pc=0x5ed77b
net/http.(*Client).do(0x8d3d40, 0xc0007d1a00)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:716 +0x8fb fp=0xc00016fe50 sp=0xc00016fc48 pc=0x5ef99b
net/http.(*Client).Do(...)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:582
main.writeCancel(0x73b140?, {0x6ca22a, 0x14}, 0x1?, {0x1?, 0x0?})
....

What did you expect to see?

I expected not to see a panic.

What did you see instead?

I saw a panic from within the standard library. This issue occurred when one of the servers was down.

randall77 commented 1 year ago

This is strange.

runtime.(*waitq).enqueue(...) /opt/homebrew/opt/go/libexec/src/runtime/chan.go:766

At that line, the pointer dereference can't possibly be nil.

This looks like memory corruption. Have you tried running with the race detector? In particular, the go cancel() line looks strange. Isn't the request still using that context? Perhaps you mean defer cancel()?

yznima commented 1 year ago

@randall77

Isn't the request still using that context? Perhaps you mean defer cancel()?

No that is on purpose to simulate a context cancelation.

At that line, the pointer dereference can't possibly be nil.

That's what I thought. I was able to capture a core dump of a similar process that was panicing with the same error. The sudog points to an invalid address. It is definitely a memory corruption. It also seems like the Sudog struct is getting released or overwritten. In fact, if you look at the two different stack traces, seems like one stack trace is trying to release the sudog and the other one is trying to add it to the waitq. I suspect those two calls are racing with each other. I've tried running it with the -race but that doesn't reproduce the issue. I don't know why.

See the core dump

Screen Shot 2023-11-08 at 5 59 12 PM
yznima commented 1 year ago

Adding the GC and SCHED last line calls before the panic

SCHED

SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 [0 161]
fatal error: unexpected signal during runtime execution
SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 gcwaiting=false nmidlelocked=0 stopwait=0 sysmonwait=false
  P0: status=1 schedtick=96440 syscalltick=121539 m=0 runqsize=0 gfreecnt=61 timerslen=21
  P1: status=1 schedtick=100650 syscalltick=126085 m=5 runqsize=163 gfreecnt=52 timerslen=9
  M12: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M11: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
  M10: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M9: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M8: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M7: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M6: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M5: p=1 curg=194 mallocing=1 throwing=2 preemptoff= locks=3 dying=1 spinning=false blocked=false lockedg=nil
  M4: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M3: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M2: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M1: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
  M0: p=0 curg=166125 mallocing=0 throwing=0 preemptoff= locks=3 dying=0 spinning=false blocked=false lockedg=nil

  ...
  G194: status=2(sync.Mutex.Lock) m=5 lockedm=nil
  G166125: status=2(sync.Mutex.Lock) m=0 lockedm=ni
  ...

GC

gc 96 @24.262s 6%: 0.082+33+0.003 ms clock, 0.16+14/10/0+0.006 ms cpu, 8->8->3 MB, 8 MB goal, 1 MB stacks, 0 MB globals, 2 P
mknyszek commented 11 months ago

Sorry for the lack of follow-up here; it got into our triage queue last week, but we didn't get to it. We looked at it this week, but probably no one will follow up until next week because of the US holiday.

I tried to reproduce at tip-of-tree and with go1.21.0 (just what I had lying around) with some slightly modifications (3 servers all running on localhost on 3 different ports), but I haven't been able to yet. I'll leave it running for a while. How quickly does this reproduce for you?

This is mysterious enough that maybe we want to take some shots in the dark:

I also noticed that quite a few people have given this issue a thumbs-up; could anyone who did so briefly follow-up? Are you also affected? Is this happening often? Is there any data you can share about the execution environment?

Thanks.

yznima commented 11 months ago

Hey @mknyszek, I've gathered some of the information you were looking for. I'll get back to you about the Kernel Version.

How quickly does this reproduce for you?

If run the program about 10 times and stop everytime after 1 minute, I'm guaranteed to see it at least once.

What version of the Linux kernel are you running?

This is happening on Centos 7.9.

Can you reproduce if you build your programs from a Linux machine? What about the same machine you're testing on?

I rebuilt on the same machine using the latest Go version and still reproduced the issue.

Does it reproduce if GODEBUG=asyncpreemptoff=1 is set?

Yes it still reproduces the issue.

mknyszek commented 11 months ago

Thanks. Does this fail on other Linux machines? Perhaps with different Linux distros and/or different Linux kernel versions? I haven't been able to reproduce so far.

yznima commented 11 months ago

@mknyszek I've been able to reproduce it on other linux machines as long as it was the same OS. I haven't been able to reproduce it on any other distros or kernel version. I've only tested it on Amazon AL2 other than that.

yznima commented 10 months ago

Hi @mknyszek,

Happy New Year 🎉 . I apologize about the delay in responding but I haven’t forgotten about this issue nor have I stopped working on it. Today I was actually able to find much more detailed instruction and information that could point more closely why this issue occurs.

In summary, the problem seems to arise specifically when there is an IPSec tunnel between two nodes. I have consistently and quickly reproduced this bug when the connection between the two nodes is secured using an IPSec tunnel. Below are detailed instructions on how to reproduce this issue. Please note that my testing has been in an AWS environment, so I've tailored the instructions to align closely with AWS. Feel free to make adjustments to suit your specific environment.

Create two EC2 instances

Setup the VPN Tunnel

sudo yum install -y libreswan # Using Libreswan version 3.25
sudo systemctl enable ipsec --now
conn instance-2
    type=transport
    left=<INSTANCE2_IP>
    right=<INSTANCE1_IP>
    authby=secret
    auto=start
<INSTANCE2_IP> <INSTANCE1_IP> : PSK "<PRE_SHARED_KEY>"
ipsec auto --add instance-1
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-1
ipsec auto --add instance-2
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-2

You should see a log such as following indicating the VPN tunnel is created

STATE_V2_IPSEC_I: IPsec SA established transport mode

One quick note, for the IPSec tunnel, your security group should be configured as following to allow for the tunnel to be created.

Screen Shot 2024-01-10 at 9 55 59 PM

Run the executables

You should be able to see the issue reproduce within a minute.

Hope this helps. Looking forward to hearing back from you.

yznima commented 9 months ago

@mknyszek I wanted to add a new piece of information I've discovered. It might be useful in your investigation and just a note to anyone else that might run into this in future. Using StrongSWAN instead of LibreSWAN resolves the issue. In addition using the AL2 AMI doesn't reproduce the issue.