Closed Kamilcuk closed 8 months ago
Hi @Kamilcuk! I've been able to repro this with a lot of attempts but don't have an obvious root cause. I know we touched the Alloc Exec API a good bit to support Actions in Nomad 1.7.0, so presumably there's something we broke there. We'll investigate and report back.
Hi, I prepared the following reproducing script called ./reproducible.sh
:
#!/bin/bash
set -xeuo pipefail
if [[ ! -d nomad-tools ]]; then
# Setup the repo
git clone https://github.com/Kamilcuk/nomad-tools.git
cd nomad-tools
git checkout 44a47aee1f4330d1e075b46285a88ebbd7303bd2
pip install -e '.[test]'
cd ..
fi
if ! nomad status test | grep -i running; then
# Run the job if not running.
nomad job run - <<'EOF'
job "test" {
type = "batch"
meta {
uuid = uuidv4()
}
group "test" {
task "test" {
driver = "raw_exec"
config {
command = "sh"
args = ["-xc", "sleep 60"]
}
}
}
}
EOF
while ! nomad status test | grep -i running; do sleep 1; done
fi
if ((1)); then
# Run the test
cd ./nomad-tools
./integration_tests.sh -k test_nomad_cp_complete
cd ..
fi
Execute nomad agent -dev
version 1.7.3 on one terminal, and execute ./reproducible.sh
on a second. Python3 and git and pip is needed.
I am consistently getting 2024-01-18T11:43:05.885+0100 [ERROR] http: http: panic serving 127.0.0.1:40300: concurrent write to websocket connectio
errors in logs. I was able to reproduce with nomad 1.7.3 on my Arch Linux 6.7.0-zen3-1-zen and on WSL2 ubuntu.
I can see the same sometimes (1.7.3 running on Ubuntu 22.04). Three nomad servers federated to another three and one client attached to each of those clusters (just a testbed).
I can see the same using v1.5.11 running in Debian 11.
I have this on my journalctl
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: panic: concurrent write to websocket connection
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: goroutine 6432 [running]:
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001542e88, 0x1, {0xc001da8110?, 0x7fc7d80c5a50?>
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket@v1.5.0/conn.go:617 +0x4b8
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*Conn).WriteMessage(0xc004488f20, 0x4d0f750?, {0xc001da8110, 0x2, 0x2})
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket@v1.5.0/conn.go:770 +0x127
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:605 +0x48b
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: created by github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl in goroutine 6446
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:590 +0x3a6
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Failed with result 'exit-code'.
This error specifically when using nomad alloc exec
. nomad alloc exec
executes some queries first and then right right after them connects via the websocket. If I connect straight using only the websocket, even executing a lot of them, I am not able to reproduce this bug. Bottom line, this suggests, this issue is about integration of some other endpoints with the exec endpoint.
Hey folks, sorry about the delay on this. On the surface it looks like this was introduced in https://github.com/hashicorp/nomad/pull/19172 which shipped in Nomad 1.7.0 (with backports to 1.6.4 and 1.5.11), and there's definitely a bug in that PR, which I'll explain below. But that bug unfortunately doesn't explain the panic.
The relevant blocks of code are alloc_endpoint.go#L589-L655
and alloc_endpoint.go#L670-L688
.
In #19172, we added a check if the error returned from decoding from the websocket was one of several benign "close errors". The trouble is that this check incorrectly assumed that errors other than those with valid websocket message error codes were of type HTTPCodedError
.
But that's not the cause of the panic! When I hit this error:
websocket: close 1006 (abnormal closure): unexpected EOF"
while running a build with Go's data race detection on, I see the following data race reported:
Which means these two writes are happening at the same time: alloc_endpoint.go#L608
and alloc_endpoint.go#L628
. The write on line 652 shouldn't be happening until we send on the errCh
, at which point the WriteMessage
call on line 608 should already have completed.
So that's puzzling. I've got a fairly straightforward fix for the error handling bug. What I'm going to try to do next is move the WriteMessage
on line 652 up into the same goroutine as all the other writes. If we can hit the bug even in that case, there's likely a bug in the upstream library that we may need to code around. Will pick that up tomorrow.
Draft PR with the fix is here: https://github.com/hashicorp/nomad/pull/19932 but I'm working up a test for it before I mark that ready for review.
Nomad version
Operating system and Environment details
Archlinux.
Issue
I am executing a lot of
nomad job exec
API commands to test some stuff and Nomad logs some panics:When running not
-dev
instance, sometimes Nomad process terminates (!!!):Reproduction steps
Run nomad 1.7.2 agent -dev.
Execute a lot of:
Expected Result
There should be no exceptions in logs.
Actual Result
There are exceptions in logs and occasionally Nomad process terminates.
Reproducible on 1.7.0, 1.7.1, 1.7.2.
Not reproducible on 1.6.3.
Job file (if appropriate)
Nomad logs
https://pastebin.com/5vhAPCB1
https://pastebin.com/ydJLgzHp