Open yznima opened 1 year ago
This is strange.
runtime.(*waitq).enqueue(...) /opt/homebrew/opt/go/libexec/src/runtime/chan.go:766
At that line, the pointer dereference can't possibly be nil.
This looks like memory corruption. Have you tried running with the race detector?
In particular, the go cancel()
line looks strange. Isn't the request still using that context? Perhaps you mean defer cancel()
?
@randall77
Isn't the request still using that context? Perhaps you mean defer cancel()?
No that is on purpose to simulate a context cancelation.
At that line, the pointer dereference can't possibly be nil.
That's what I thought. I was able to capture a core dump of a similar process that was panicing with the same error. The sudog points to an invalid address. It is definitely a memory corruption. It also seems like the Sudog struct is getting released or overwritten. In fact, if you look at the two different stack traces, seems like one stack trace is trying to release the sudog and the other one is trying to add it to the waitq. I suspect those two calls are racing with each other. I've tried running it with the -race but that doesn't reproduce the issue. I don't know why.
See the core dump
Adding the GC and SCHED last line calls before the panic
SCHED
SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 [0 161]
fatal error: unexpected signal during runtime execution
SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 gcwaiting=false nmidlelocked=0 stopwait=0 sysmonwait=false
P0: status=1 schedtick=96440 syscalltick=121539 m=0 runqsize=0 gfreecnt=61 timerslen=21
P1: status=1 schedtick=100650 syscalltick=126085 m=5 runqsize=163 gfreecnt=52 timerslen=9
M12: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M11: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
M10: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M9: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M8: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M7: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M6: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M5: p=1 curg=194 mallocing=1 throwing=2 preemptoff= locks=3 dying=1 spinning=false blocked=false lockedg=nil
M4: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M3: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M2: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
M1: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
M0: p=0 curg=166125 mallocing=0 throwing=0 preemptoff= locks=3 dying=0 spinning=false blocked=false lockedg=nil
...
G194: status=2(sync.Mutex.Lock) m=5 lockedm=nil
G166125: status=2(sync.Mutex.Lock) m=0 lockedm=ni
...
GC
gc 96 @24.262s 6%: 0.082+33+0.003 ms clock, 0.16+14/10/0+0.006 ms cpu, 8->8->3 MB, 8 MB goal, 1 MB stacks, 0 MB globals, 2 P
Sorry for the lack of follow-up here; it got into our triage queue last week, but we didn't get to it. We looked at it this week, but probably no one will follow up until next week because of the US holiday.
I tried to reproduce at tip-of-tree and with go1.21.0 (just what I had lying around) with some slightly modifications (3 servers all running on localhost on 3 different ports), but I haven't been able to yet. I'll leave it running for a while. How quickly does this reproduce for you?
This is mysterious enough that maybe we want to take some shots in the dark:
GODEBUG=asyncpreemptoff=1
is set?I also noticed that quite a few people have given this issue a thumbs-up; could anyone who did so briefly follow-up? Are you also affected? Is this happening often? Is there any data you can share about the execution environment?
Thanks.
Hey @mknyszek, I've gathered some of the information you were looking for. I'll get back to you about the Kernel Version.
How quickly does this reproduce for you?
If run the program about 10 times and stop everytime after 1 minute, I'm guaranteed to see it at least once.
What version of the Linux kernel are you running?
This is happening on Centos 7.9.
Can you reproduce if you build your programs from a Linux machine? What about the same machine you're testing on?
I rebuilt on the same machine using the latest Go version and still reproduced the issue.
Does it reproduce if GODEBUG=asyncpreemptoff=1 is set?
Yes it still reproduces the issue.
Thanks. Does this fail on other Linux machines? Perhaps with different Linux distros and/or different Linux kernel versions? I haven't been able to reproduce so far.
@mknyszek I've been able to reproduce it on other linux machines as long as it was the same OS. I haven't been able to reproduce it on any other distros or kernel version. I've only tested it on Amazon AL2 other than that.
Hi @mknyszek,
Happy New Year 🎉 . I apologize about the delay in responding but I haven’t forgotten about this issue nor have I stopped working on it. Today I was actually able to find much more detailed instruction and information that could point more closely why this issue occurs.
In summary, the problem seems to arise specifically when there is an IPSec tunnel between two nodes. I have consistently and quickly reproduced this bug when the connection between the two nodes is secured using an IPSec tunnel. Below are detailed instructions on how to reproduce this issue. Please note that my testing has been in an AWS environment, so I've tailored the instructions to align closely with AWS. Feel free to make adjustments to suit your specific environment.
sudo yum install -y libreswan # Using Libreswan version 3.25
sudo systemctl enable ipsec --now
Create a PreSharedKey using the following command
openssl rand -base64 128 # The output should be put in one line when using in the subsequent files
On the instance 1 create the following files and content
conn instance-2
type=transport
left=<INSTANCE2_IP>
right=<INSTANCE1_IP>
authby=secret
auto=start
<INSTANCE2_IP> <INSTANCE1_IP> : PSK "<PRE_SHARED_KEY>"
ipsec auto --add instance-1
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-1
ipsec auto --add instance-2
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-2
You should see a log such as following indicating the VPN tunnel is created
STATE_V2_IPSEC_I: IPsec SA established transport mode
One quick note, for the IPSec tunnel, your security group should be configured as following to allow for the tunnel to be created.
./server
./client –hosts <INSTANCE2_IP>
You should be able to see the issue reproduce within a minute.
Hope this helps. Looking forward to hearing back from you.
@mknyszek I wanted to add a new piece of information I've discovered. It might be useful in your investigation and just a note to anyone else that might run into this in future. Using StrongSWAN instead of LibreSWAN resolves the issue. In addition using the AL2 AMI doesn't reproduce the issue.
This is continuation of https://github.com/golang/go/issues/61552 since that one is closed.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes. All versions after go.19, and including, reproduce the issue.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I compiled a client code and server code as following on my M1 Mac and uploaded to a linux Server.
For client I compiled using
GOOS=linux GOARCH=amd64 go build -o client
For server I compiled using
GOOS=linux GOARCH=amd64 go build -o server
I ran the server in 3 different EC2 instances and in docker.
I used the following command to run it.
When I ran the server on the three different machines using host networking and then exected the client I see the client panics with the following stacktraces
What did you expect to see?
I expected not to see a panic.
What did you see instead?
I saw a panic from within the standard library. This issue occurred when one of the servers was down.