hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Consul DNS query is returning the nodes that are in maintenance mode #3268

Open phanidileep opened 7 years ago

phanidileep commented 7 years ago

If you have a question, please direct it to the consul mailing list if it hasn't been addressed in either the FAQ or in one of the Consul Guides.

When filing a bug, please include the following:

consul version for both Client and Server

Client: 0.7.4 Server: 0.7.4

consul info for both Client and Server

Client:

agent:                                                                                                                                                                                                                                       
        check_monitors = 0                                                                                                                                                                                                                   
        check_ttls = 0                                                                                                                                                                                                                       
        checks = 1                                                                                                                                                                                                                           
        services = 0                                                                                                                                                                                                                         
build:                                                                                                                                                                                                                                       
        prerelease =                                                                                                                                                                                                                         
        revision = '1c442cb                                                                                                                                                                                                                  
        version = 0.7.4                                                                                                                                                                                                                      
consul:                                                                                                                                                                                                                                      
        known_servers = 5                                                                                                                                                                                                                    
        server = false                                                                                                                                                                                                                       
runtime:                                                                                                                                                                                                                                     
        arch = amd64                                                                                                                                                                                                                         
        cpu_count = 72                                                                                                                                                                                                                       
        goroutines = 36                                                                                                                                                                                                                      
        max_procs = 72                                                                                                                                                                                                                       
        os = linux                                                                                                                                                                                                                           
        version = go1.7.5                                                                                                                                                                                                                    
serf_lan:                                                                                                                                                                                                                                    
        encrypted = false                                                                                                                                                                                                                    
        event_queue = 0                                                                                                                                                                                                                      
        event_time = 4291                                                                                                                                                                                                                    
        failed = 34                                                                                                                                                                                                                          
        health_score = 0                                                                                                                                                                                                                     
        intent_queue = 0                                                                                                                                                                                                                     
        left = 2                                                                                                                                                                                                                             
        member_time = 192908                                                                                                                                                                                                                 
        members = 94                                                                                                                                                                                                                         
        query_queue = 0                                                                                                                                                                                                                      
        query_time = 5876   

Server:

Same build as Client

Operating system and Environment details

Linux 3.10.0-514.10.2.el7.x86_64

Description of the Issue (and unexpected/desired result)

Based on the document https://www.consul.io/docs/commands/maint.html nodes that are set in Maintenance mode should not be appear in the DNS query. But seems like this is not working as expected.

I am able to ping the Node after setting it in the maintenance mode.

Reproduction steps

sh-4.2# consul maint - enable
sh-4.2# consul maint
Node:
Name: toolkit-d08wh
Reason: Maintenance mode is enabled for this node, but no reason was provided. This is a default message.

sh-4.2# ping toolkit-d08wh.node.dc1.com
PING toolkit-d08wh.node.dc1.com (10.0.33.23) 56(84) bytes of data.
64 bytes from toolkit-d08wh (10.0.33.23): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from toolkit-d08wh (10.0.33.23): icmp_seq=2 ttl=64 time=0.021 ms
64 bytes from toolkit-d08wh (10.0.33.23): icmp_seq=3 ttl=64 time=0.019 ms
64 bytes from toolkit-d08wh (10.0.33.23): icmp_seq=4 ttl=64 time=0.027 ms

Appreciate you time and suggestions.

hbmelachuru10 commented 7 years ago

+1

kamaradclimber commented 7 years ago

We see this issue in 0.8.5 as well.

I think it is more deep than dns query. For instance:

curl -v -XPUT http://localhost:8500/v1/agent/service/maintenance/consul-agent-http?enable=true

returns 200 OK

curl -v http://localhost:8500/v1/agent/checks returns correctly the health check related to maintenance:

{"_service_maintenance:consul-agent-http":
{"Node":"consul01-par.central.criteo.preprod","CheckID":"_service_maintenance:consul-agent-http","Name":"Service Maintenance Mode","Status":"critical","Notes":"Maintenance mode is enabled for this service, but no reason was provided. This is a default message.","Output":"","ServiceID":"consul-agent-http","ServiceName":"consul-agent-http","ServiceTags":[],"CreateIndex":0,"ModifyIndex":0}

but curl -v http://localhost:8500/v1/health/checks/consul-agent-http 200 OK with body [] (in this scenario consul-agent-http has no healthcheck defined)

I would have expected to see the same healthcheck called _service_maintenance:consul-agent-http.

The same behavior can be reproduced with any service with healthcheck defined.

slackpad commented 7 years ago

Hi @phanidileep in your example you are doing a node query toolkit-d08wh.node.dc1.com which isn't affected by the maintenance mode. We should make the documentation more clear, but the maintenance mode prevents that node from coming back in any service queries, since that's where the health check filtering is applied. If toolkit-d08wh.node.dc1.com was running an instance of the foo service then toolkit-d08wh.node.dc1.com would never show up in a query for foo.service.dc1.com. If you just ask for a node directly then it will be returned, regardless of its health status.

phanidileep commented 7 years ago

@slackpad Thanks for the clarification. Can you share the list of Status in Consul. will have to check how maintenance mode status in handled in the the Telemetry e.x https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul

kamaradclimber commented 6 years ago

@slackpad I can still see this issue on consul 0.9.3. Is this expected?

slackpad commented 6 years ago

@kamaradclimber I think this is a documentation issue, but not an actual code issue. If you ask for a node directly it doesn't factor in the health (or maintenance status) of the node. That is only considered when you are looking up a service over DNS.

rockpapergoat commented 6 years ago

i see this same behavior via the catalog endpoint, too.

enable maintenance for a service or node, then searching the catalog for the service includes the node in results. is this intended? is the dns endpoint the only one that doesn't include nodes or services in maintenance mode? i'm seeing this with consul 1.0.2 agents and accessing the various APIs via the diplomat gem and curl.

EDIT: this could probably use some clarification on the docs for the catalog and other endpoints. maybe just explicitly state in docs for each endpoint whether they respect health status. i didn't realize just the dns and health endpoints reflect health state.

though, just a quick test of two lookups for a service that isn't failing shows this, which also feels wrong:

curl -s "$CONSUL_HTTP_ADDR/v1/health/service/foo?passing=true"| jq '.[] | .Checks[0].Status'
"passing"
"passing"
curl -s "$CONSUL_HTTP_ADDR/v1/health/service/foo?passing=false"| jq '.[] | .Checks[0].Status'
"passing"
"passing"

it looks like this message seems to indicate including the "passing" parameter implies either results for nodes/services with non-critical statuses or no defined check in any state. that is also a little confusing.