hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

panic serving: concurrent write to websocket connection #19506

Closed Kamilcuk closed 8 months ago

Kamilcuk commented 10 months ago

Nomad version

$ /usr/bin/nomad --version
Nomad v1.7.2
BuildDate 2023-12-13T19:59:42Z
Revision 64e3dca9274b493e38a49fda3a70fd31d0485b91

Operating system and Environment details

Archlinux.

Issue

I am executing a lot of nomad job exec API commands to test some stuff and Nomad logs some panics:

    2023-12-16T22:56:25.020+0100 [DEBUG] http: request complete: method=GET path="/v1/client/allocation/9ececd9c-a005-40e5-b2ef-38b4c2454424/exec?command=%5B%22sh%22%2C%22-c%22%
2C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+f+-name+%5C%22%243%2A%5C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+d+-name+%
5C%22%243%2A%5C%22+%7C%5Cn++++++++++++++++while+IFS%3D+read+-r+line%3B+do+echo+%5C%22%24line%2F%5C%22%3B+done%5Cn++++++++++++%22%2C%22--%22%2C%22%2Fusr%2Fbi%22%2C%22%2Fusr%2F%22
%2C%22bi%22%5D&namespace=default&region=global&task=test_nomad_cp_complete&tty=false" duration=9.321342ms                                                                        
    2023-12-16T22:56:25.020+0100 [ERROR] http: http: panic serving 127.0.0.1:50746: concurrent write to websocket connection                                                     
goroutine 3830 [running]:                                                                                                                                                        
net/http.(*conn).serve.func1()                                                                                                                                                   
        net/http/server.go:1868 +0xb9                                                                                                                                            
panic({0x28e3c80?, 0x37baf50?})                                                                                                                                                  
        runtime/panic.go:920 +0x270                                                                                                                                              
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001227428, 0x1, {0xc00170e6c0?, 0x7fe1d8062338?, 0x7fe220b35a68?})                                                   
        github.com/gorilla/websocket@v1.5.0/conn.go:617 +0x4b8                                                                                                                   
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc0010bcb00, 0xc000ac8900?, {0xc00170e6c0, 0x3a, 0x3a})                                                                       
        github.com/gorilla/websocket@v1.5.0/conn.go:770 +0x127                                                                                                                   
github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl(0xc000b4a0e0, 0xc0010bcb00, 0xc00118db00)                                                                  
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:652 +0x5bd                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec(0xc000b4a0e0, {0xc0014346da, 0x24}, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                    
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0x685                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest(0xc000b4a0e0, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                                 
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x15d                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45({0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                       
        github.com/hashicorp/nomad/command/agent/http.go:715 +0x168                                                                                                              
net/http.HandlerFunc.ServeHTTP(0xc0009be540?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000efbb00?)                                                                                        
        net/http/server.go:2136 +0x29                                                                                                                                            
github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1({0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                               
        github.com/rs/cors@v1.8.3/cors.go:236 +0x184                                                                                                                             
net/http.HandlerFunc.ServeHTTP(0xc000c7df80?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000d6b9e8?)                                                                                        
        net/http/server.go:2136 +0x29                                                                                                                                            
net/http.(*ServeMux).ServeHTTP(0xc000c8ea10?, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                                                                           
        net/http/server.go:2514 +0x142                                                                                                                                           
github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3({0x37d6e70?, 0xc0009b67e0}, 0xc000efbb00)
        github.com/gorilla/handlers@v1.5.1/compress.go:105 +0x58d
net/http.HandlerFunc.ServeHTTP(0x51f50c0?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000d6bb50?)
        net/http/server.go:2136 +0x29
net/http.serverHandler.ServeHTTP({0xc000c7df20?}, {0x37d6e70?, 0xc0009b67e0?}, 0x6?)
        net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc00100cab0, {0x37e1438, 0xc00070d860})
        net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 264
        net/http/server.go:3086 +0x5cb

When running not -dev instance, sometimes Nomad process terminates (!!!):

    2023-12-16T23:51:19.381+0100 [INFO]  client: task exec session ended: exec_id=7f5f46fc-22ef-af79-ce50-e3a442c02aed
    2023-12-16T23:51:19.381+0100 [ERROR] http: request failed: method=GET path="/v1/client/allocation/e6e0b6a7-1dbe-01b6-4c7a-a7a3e3299dc4/exec?command=%5B%22sh%22%2C%22-c%22%2C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+f+-name+%5C%22%243%2A%5C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+d+-name+%5C%22%243%2A%5C%22+%7C%5Cn++++++++++++++++while+IFS%3D+read+-r+line%3B+do+echo+%5C%22%24line%2F%5C%22%3B+done%5Cn++++++++++++%22%2C%22--%22%2C%22.%2Fh%22%2C%22.%2F%22%2C%22h%22%5D&namespace=default&region=global&task=test_nomad_cp_complete&tty=false" error="websocket: close 1006 (abnormal closure): unexpected EOF" code=500
panic: concurrent write to websocket connection

goroutine 1809 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc000e82e88, 0x1, {0xc000df8568?, 0xdc899b?, 0x7f5cf22aa5b8?})
        github.com/gorilla/websocket@v1.5.0/conn.go:617 +0x4b8
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc000ff5a20, 0x4f15760?, {0xc000df8568, 0x2, 0x2})
        github.com/gorilla/websocket@v1.5.0/conn.go:770 +0x127
github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:608 +0x48b
created by github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl in goroutine 1806
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:593 +0x3a6

Reproduction steps

Run nomad 1.7.2 agent -dev.

Execute a lot of:

nomad alloc exec -t=false -i=false -task=test_nomad_cp_complete e6e0b6a7-1dbe-01b6-4c7a-a7a3e3299dc4 sh -c '                                                                   
            find "$2" -maxdepth 1 -mindepth 1 -type f -name "$3*"                                                                                                                
            find "$2" -maxdepth 1 -mindepth 1 -type d -name "$3*" |                                                                                                              
                while IFS= read -r line; do echo "$line/"; done                                                                                                                  
            ' -- ./h ./h

Expected Result

There should be no exceptions in logs.

Actual Result

There are exceptions in logs and occasionally Nomad process terminates.

Reproducible on 1.7.0, 1.7.1, 1.7.2.

Not reproducible on 1.6.3.

Job file (if appropriate)

variable "script" {
  type = string
  default = ""
}
job "test-test_nomad_cp_complete" {
  type = "batch"
  meta {
    uuid = uuidv4()
  }
  group "test_nomad_cp_complete" {
    reschedule {
      attempts = 0
    }
    restart {
      attempts = 0
    }
    task "test_nomad_cp_complete" {
      driver = "raw_exec"
      config {
        command = "sh"
        args = ["-xc", <<-EOF
          exec sleep 60
          ${var.script}
        EOF
        ]
      }
    }
  }
}

Nomad logs

https://pastebin.com/5vhAPCB1

https://pastebin.com/ydJLgzHp

tgross commented 9 months ago

Hi @Kamilcuk! I've been able to repro this with a lot of attempts but don't have an obvious root cause. I know we touched the Alloc Exec API a good bit to support Actions in Nomad 1.7.0, so presumably there's something we broke there. We'll investigate and report back.

Kamilcuk commented 9 months ago

Hi, I prepared the following reproducing script called ./reproducible.sh:

#!/bin/bash
set -xeuo pipefail
if [[ ! -d nomad-tools ]]; then
    # Setup the repo
    git clone https://github.com/Kamilcuk/nomad-tools.git
    cd nomad-tools
    git checkout 44a47aee1f4330d1e075b46285a88ebbd7303bd2
    pip install -e '.[test]'
    cd ..
fi
if ! nomad status test | grep -i running; then
    # Run the job if not running.
    nomad job run - <<'EOF'
    job "test" {
      type = "batch"
      meta {
        uuid = uuidv4()
      }
      group "test" {
        task "test" {
          driver = "raw_exec"
          config {
            command = "sh"
            args = ["-xc", "sleep 60"]
          }
        }
      }
    }
EOF
    while ! nomad status test | grep -i running; do sleep 1; done
fi
if ((1)); then
    # Run the test
    cd ./nomad-tools
    ./integration_tests.sh -k test_nomad_cp_complete
    cd ..
fi

Execute nomad agent -dev version 1.7.3 on one terminal, and execute ./reproducible.sh on a second. Python3 and git and pip is needed.

I am consistently getting 2024-01-18T11:43:05.885+0100 [ERROR] http: http: panic serving 127.0.0.1:40300: concurrent write to websocket connectio errors in logs. I was able to reproduce with nomad 1.7.3 on my Arch Linux 6.7.0-zen3-1-zen and on WSL2 ubuntu.

bra-fsn commented 8 months ago

I can see the same sometimes (1.7.3 running on Ubuntu 22.04). Three nomad servers federated to another three and one client attached to each of those clusters (just a testbed).

fqborges commented 8 months ago

I can see the same using v1.5.11 running in Debian 11.

I have this on my journalctl

Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: panic: concurrent write to websocket connection
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: goroutine 6432 [running]:
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001542e88, 0x1, {0xc001da8110?, 0x7fc7d80c5a50?>
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/gorilla/websocket@v1.5.0/conn.go:617 +0x4b8
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*Conn).WriteMessage(0xc004488f20, 0x4d0f750?, {0xc001da8110, 0x2, 0x2})
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/gorilla/websocket@v1.5.0/conn.go:770 +0x127
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:605 +0x48b
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: created by github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl in goroutine 6446
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:590 +0x3a6
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Failed with result 'exit-code'.
Kamilcuk commented 8 months ago

This error specifically when using nomad alloc exec. nomad alloc exec executes some queries first and then right right after them connects via the websocket. If I connect straight using only the websocket, even executing a lot of them, I am not able to reproduce this bug. Bottom line, this suggests, this issue is about integration of some other endpoints with the exec endpoint.

tgross commented 8 months ago

Hey folks, sorry about the delay on this. On the surface it looks like this was introduced in https://github.com/hashicorp/nomad/pull/19172 which shipped in Nomad 1.7.0 (with backports to 1.6.4 and 1.5.11), and there's definitely a bug in that PR, which I'll explain below. But that bug unfortunately doesn't explain the panic.

The relevant blocks of code are alloc_endpoint.go#L589-L655 and alloc_endpoint.go#L670-L688.

In #19172, we added a check if the error returned from decoding from the websocket was one of several benign "close errors". The trouble is that this check incorrectly assumed that errors other than those with valid websocket message error codes were of type HTTPCodedError.

But that's not the cause of the panic! When I hit this error:

websocket: close 1006 (abnormal closure): unexpected EOF"

while running a build with Go's data race detection on, I see the following data race reported:

data race ``` ================== WARNING: DATA RACE Read at 0x00c00019d258 by goroutine 6211: github.com/gorilla/websocket.(*Conn).beginMessage() github.com/gorilla/websocket@v1.5.0/conn.go:479 +0x56 github.com/gorilla/websocket.(*Conn).WriteMessage() github.com/gorilla/websocket@v1.5.0/conn.go:764 +0xee github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:652 +0x975 github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0xa08 github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x2ab github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm() :1 +0x5d github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45() github.com/hashicorp/nomad/command/agent/http.go:716 +0x1e7 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1() github.com/rs/cors@v1.8.3/cors.go:236 +0x2f1 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 net/http.(*ServeMux).ServeHTTP() net/http/server.go:2514 +0xbc github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3() github.com/gorilla/handlers@v1.5.1/compress.go:105 +0x738 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 net/http.serverHandler.ServeHTTP() net/http/server.go:2938 +0x2a1 net/http.(*conn).serve() net/http/server.go:2009 +0xc24 net/http.(*Server).Serve.func3() net/http/server.go:3086 +0x4f Previous write at 0x00c00019d258 by goroutine 6214: github.com/gorilla/websocket.(*messageWriter).endMessage() github.com/gorilla/websocket@v1.5.0/conn.go:546 +0xce github.com/gorilla/websocket.(*messageWriter).flushFrame() github.com/gorilla/websocket@v1.5.0/conn.go:633 +0xb64 github.com/gorilla/websocket.(*Conn).WriteMessage() github.com/gorilla/websocket@v1.5.0/conn.go:770 +0x1c4 github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:608 +0x768 github.com/hashicorp/go-msgpack/codec.(*decReaderSwitch).readn1() github.com/hashicorp/go-msgpack@v1.1.5/codec/decode.go:2224 +0x47 github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).readNextBd() github.com/hashicorp/go-msgpack@v1.1.5/codec/msgpack.go:742 +0x6b github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).TryDecodeAsNil() github.com/hashicorp/go-msgpack@v1.1.5/codec/msgpack.go:791 +0x55 github.com/hashicorp/go-msgpack/codec.(*Decoder).mustDecode() github.com/hashicorp/go-msgpack@v1.1.5/codec/decode.go:2536 +0x4b github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode() github.com/hashicorp/go-msgpack@v1.1.5/codec/decode.go:2519 +0xe5 github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:606 +0x324 Goroutine 6211 (running) created at: net/http.(*Server).Serve() net/http/server.go:3086 +0x80c github.com/hashicorp/nomad/command/agent.NewHTTPServers.func1() github.com/hashicorp/nomad/command/agent/http.go:191 +0xc4 Goroutine 6214 (finished) created at: github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:593 +0x66a github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0xa08 github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest() github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x2ab github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm() :1 +0x5d github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45() github.com/hashicorp/nomad/command/agent/http.go:716 +0x1e7 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1() github.com/rs/cors@v1.8.3/cors.go:236 +0x2f1 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 net/http.(*ServeMux).ServeHTTP() net/http/server.go:2514 +0xbc github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3() github.com/gorilla/handlers@v1.5.1/compress.go:105 +0x738 net/http.HandlerFunc.ServeHTTP() net/http/server.go:2136 +0x47 net/http.serverHandler.ServeHTTP() net/http/server.go:2938 +0x2a1 net/http.(*conn).serve() net/http/server.go:2009 +0xc24 net/http.(*Server).Serve.func3() net/http/server.go:3086 +0x4f ================== ```

Which means these two writes are happening at the same time: alloc_endpoint.go#L608 and alloc_endpoint.go#L628. The write on line 652 shouldn't be happening until we send on the errCh, at which point the WriteMessage call on line 608 should already have completed.

So that's puzzling. I've got a fairly straightforward fix for the error handling bug. What I'm going to try to do next is move the WriteMessage on line 652 up into the same goroutine as all the other writes. If we can hit the bug even in that case, there's likely a bug in the upstream library that we may need to code around. Will pick that up tomorrow.

tgross commented 8 months ago

Draft PR with the fix is here: https://github.com/hashicorp/nomad/pull/19932 but I'm working up a test for it before I mark that ready for review.

tgross commented 8 months ago

19932 has been merged and will ship in the next version of Nomad 1.7.x, with backports to 1.6.x and 1.5.x