hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.99k stars 1.96k forks source link

Response from nomad server fails #24348

Open a-bangk opened 3 weeks ago

a-bangk commented 3 weeks ago

Nomad version

Nomad v1.9.0 BuildDate 2024-10-10T07:13:43Z Revision 7ad36851ec02f875e0814775ecf1df0229f0a615

Operating system and Environment details

Host : Windows Server 2019 Datacenter Client : Windows Server 2019 Standard

Issue

Client force closes connection from host on one of our environments.

Reproduction steps

Client.conf file

data_dir = "C:/nomad/data"

bind_addr = "0.0.0.0"

datacenter = "example_client"

client {
    enabled = true
    servers = ["our-nomad-server:4647"]
    gc_disk_usage_threshold = 95
    artifact {
        decompression_file_count_limit = 0
    }
}

server {
    enabled = false
}

plugin "raw_exec" {
    config {
        enabled = true
    }
}

Start nomad on client

Expected Result

Client shows up in Nomad host clients list.

Actual Result

Connection is shutdown.

Nomad Server logs (if appropriate)

2024-11-01T15:12:30.471+0100 [ERROR] nomad.rpc: failed to read first RPC byte: error="read tcp LOCAL_HOST_IP:4647->EXTERNAL_CLIENT_IP:43594: wsarecv: An existing connection was forcibly closed by the remote host."
2024-11-01T15:12:48.867+0100 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="read tcp LOCAL_HOST_IP:4647->EXTERNAL_CLIENT_IP:31314: wsarecv: An existing connection was forcibly closed by the remote host."

Nomad Client logs (if appropriate)

    2024-11-01T15:20:48.660+0100 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.Register server=EXTERNAL_HOST_IP:4647
    2024-11-01T15:20:48.661+0100 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.Register server=EXTERNAL_HOST_IP:4647
    2024-11-01T15:20:48.662+0100 [ERROR] client: error registering: error="rpc error: EOF"
tgross commented 2 weeks ago

Hi @a-bangk! The output you're seeing there is what I'd expect to see in the event that network connectivity was lost when the client has made an initial connection and the server is trying to figure out what kind of connection it is (TLS vs non-TLS, and Raft vs RPC). The client should retry after 15s. You may need to take a look at your network environment or TLS configuration.

a-bangk commented 2 weeks ago

Hi @tgross thanks for looking at it. You’re correct the client keeps trying on different ports. We have nomad connecting on 3 out of 4 customers (windows server) but have been stumped with what makes the one different that prevents the connection. Windows firewall allows it through, now I’ll dig into TLS config. Further pointers to isolate the block would be greatly appreciated.

tgross commented 2 weeks ago

@a-bangk having only one node fail but fail reliably sounds like a reachability issue for that node. But I'm not much of a Windows networking administrator, so I don't have much advice for you on that front.