hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

node critical. Synced check and then report HTTP request failed: Get /dev/null: unsupported protocol scheme #17809

Open sdvdxl opened 1 year ago

sdvdxl commented 1 year ago

Overview of the Issue

Reproduction Steps

  1. init docker swarm

  2. create docker stack, docker-compose

    docker-compose ```yaml version: '3.8' services: consul: hostname: consul image: "harbor.hekr.me/iotos/consul:1.15.3" deploy: replicas: 1 placement: max_replicas_per_node: 1 constraints: [node.role == manager] ports: - "8500:8500" - "8300:8300" - "8301:8301" - "8302:8302" - "8600:8600" volumes: - consulData:/consul/data networks: iot-os-network: #ipv4_address: 172.20.0.2 command: agent -server -bootstrap-expect 1 -ui -bind '{{ GetPrivateInterfaces | include "network" "172.20.0.0/24" | attr "address" }}' -client=0.0.0.0 networks: iot-os-network: ipam: config: - subnet: 172.20.0.0/24 volumes: consulData: mongoData: redisData: minioData: clickhouseData: logsData: driversData: confData: mysqlData: zookeeperData: ibosData: ```
  3. run some days

  4. logs show:

    Synced check "2R9qN31gaZdi9fySX8RiWD4ujhS" 2023/06/13 16:10:12 [WARN] agent: Check "2R9qN31gaZdi9fySX8RiWD4ujhS" HTTP request failed: Get /dev/null: unsupported protocol scheme ""

  5. need execute curl -X PUT http://127.0.0.1:8500/v1/agent/check/deregister/2R9qN31gaZdi9fySX8RiWD4ujhS deregister to recover

Consul info for both Client and Server

Client info ```yaml agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 2 build: prerelease = revision = 7ce982ce version = 1.15.3 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 1 leader = true leader_addr = 172.20.0.38:8300 server = true raft: applied_index = 2792157 commit_index = 2792157 fsm_pending = 0 last_contact = 0 last_log_index = 2792157 last_log_term = 6 last_snapshot_index = 2785461 last_snapshot_term = 6 latest_configuration = [{Suffrage:Voter ID:0ebd7757-8fe9-9bae-b624-2e21a087c6c2 Address:172.20.0.38:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 6 runtime: arch = amd64 cpu_count = 8 goroutines = 157 max_procs = 8 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 6 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ``` ``` Client agent HCL config ```
Server info ``` agent -server -bootstrap-expect 1 -ui -bind '{{ GetPrivateInterfaces | include "network" "172.20.0.0/24" | attr "address" }}' -client=0.0.0.0 ```

Operating system and Environment details

docker info Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.7.1-docker) scan: Docker Scan (Docker Inc., v0.12.0) Server: Containers: 42 Running: 15 Paused: 0 Stopped: 27 Images: 35 Server Version: 20.10.12 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: error NodeID: m739hilgnx2hjv9a9jylyjisi Is Manager: true Node Address: 211.66.32.176 Manager Addresses: 211.66.32.176:2377 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7 init version: de40ad0 Security Options: seccomp Profile: default Kernel Version: 3.10.0-1062.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 62.74GiB Name: gzic-lsjnglpt-2 ID: N6FK:ZIYH:XWFU:FFQE:SZZI:GQGM:RSB5:HAIY:XHVZ:SFTY:H3SW:TJKD Docker Root Dir: /data/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://xxxxx.mirror.aliyuncs.com/ https://xxxx.mirror.swr.myhuaweicloud.com/ Live Restore Enabled: false
os info Linux gzic-lsjnglpt-2 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

image image image image

huikang commented 1 year ago

@sdvdxl , thanks for reporting. I noticed that the following command recovers the issue

need execute curl -X PUT http://127.0.0.1:8500/v1/agent/check/deregister/2R9qN31gaZdi9fySX8RiWD4ujhS deregister to recover

Could you help clarify the definition of the check 2R9qN31gaZdi9fySX8RiWD4ujhS? The screenshot shows it's Node value is "consul", but the serviceName is "".

sdvdxl commented 1 year ago

I don't know where it came from, but my actively registered service is iot-xx. When this check is deregistered, it may generate a new one like this after a while, without serviceName, reporting the same error

phil-lavin commented 1 year ago

We have just seen this failure across over 100 nodes. There's a failing health check across all of them called 2Rxye2uPfKB1LhGyfsmDR4n3Rdy. We don't know where this came from - it just appeared today. It isn't present on non-failing nodes. fwiw, all of the failing nodes run Nomad.

image

De-registering the check on affected nodes recovers them: curl --request PUT "http://${CONSUL_HTTP_ADDR}/v1/agent/check/deregister/2Rxye2uPfKB1LhGyfsmDR4n3Rdy"

phil-lavin commented 1 year ago

We are starting to think this is as a result of a 'security' scanner looking for CVE-2022-29153. Very probably the nuclei scanner: https://github.com/projectdiscovery/nuclei-templates/pull/6488. The signature of the bad check which gets created is exactly consistent with the above-mentioned PR

Issue raised on the nuclei-templates repo: https://github.com/projectdiscovery/nuclei-templates/issues/7595

phil-lavin commented 1 year ago

Confirmed with our security folks that this was a Nuclei scan being conducted against our infrastructure, from a box inside the network. If others are seeing this erroneous /dev/null check, ensure you don't have Nuclei running inside your network and also ensure that your Consul agents are not directly accessible from the public Internet as this may be a result of a malicious 3rd party scanning your infrastructure.

Nuclei have pushed a fix to make the test more sane and also mark it as intrusive: https://github.com/projectdiscovery/nuclei-templates/pull/7597