hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.27k stars 4.42k forks source link

Consul watch fires for no reason #5044

Open dbabits opened 5 years ago

dbabits commented 5 years ago

Overview of the Issue:

#this is run against every agent in the cluster(3 server + 15 client agents)
cat <<EOF | curl -s --fail --request PUT --data @- http://$host:8500/v1/agent/service/register
    {
        "id": "NodeManager",
        "name": "NodeManager",
        "check": {
            "id": "NodeManager",
            "name": "NodeManager",
            "http": "http://localhost:8042",
            "tls_skip_verify": true,
            "method": "GET",
            "interval": "60s",
            "status": "passing",
            "timeout": "5s",
        }
    }
EOF
cat <<EOF > $CONSUL_CONF_DIR/watches.json
    {
        "watches": [
            {
            "type": "checks",
            "state":"critical",
            "handler_type": "script",
            "args": ["/path/to/watch/script"]
            }
        ]
    }
EOF

These are the times watch script fired, grouped by the hour:

sqlite> select node, count(*),group_concat(datetime(ts,'unixepoch','localtime'),', ') as watch_fired from consul_svc_handler where service in ('NodeManager') and node in ('d341369-010.foo.com') GROUP BY (ts/(60*60));
node                   count(*)    watch_fired
---------------------  ----------  ---------------------------------------------------------------------------------------------
d341369-010.foo.com  5           2018-11-30 14:00:11, 2018-11-30 14:00:16, 2018-11-30 14:01:08, 2018-11-30 14:01:16, 2018-11-30 14:01:29
d341369-010.foo.com  2           2018-11-30 16:00:26, 2018-11-30 16:01:26
d341369-010.foo.com  1           2018-11-30 17:25:08
d341369-010.foo.com  2           2018-11-30 18:01:01, 2018-11-30 18:02:00
d341369-010.foo.com  2           2018-11-30 22:00:30, 2018-11-30 22:01:30
d341369-010.foo.com  2           2018-12-01 00:00:24, 2018-12-01 00:01:24
d341369-010.foo.com  2           2018-12-01 03:00:08, 2018-12-01 03:01:08
d341369-010.foo.com  2           2018-12-01 10:00:20, 2018-12-01 10:01:18
d341369-010.foo.com  2           2018-12-01 16:00:21, 2018-12-01 16:01:16
d341369-010.foo.com  6           2018-12-01 18:00:10, 2018-12-01 18:00:10, 2018-12-01 18:00:24, 2018-12-01 18:00:24, 2018-12-01 18:01:10, 2018-12-01 18:01:24
d341369-010.foo.com  10          2018-12-02 02:00:18, 2018-12-02 02:00:18, 2018-12-02 02:00:18, 2018-12-02 02:04:37, 2018-12-02 02:15:46, 2018-12-02 02:16:22, 2018-12-02 02:16:53, 2018-12-02 02:16:53, 2018-12-02 02:16:54, 2018-12-02 02:16:54

Reproduction Steps

Steps to reproduce this issue, eg:

  1. I use a cluster with 15 client nodes and 3 server nodes
  2. Run curl ... as shown above, watching a port that returns HTTP 401, or simply is not open.
  3. observe the frequency of watch firings.

Consul info for both Client and Server

Client info ``` > consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 0bddfa23 version = 1.4.0 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 8 goroutines = 58 max_procs = 8 os = linux version = go1.11.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 21 members = 18 query_queue = 0 query_time = 1 ```
Server info ``` > consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 0bddfa23 version = 1.4.0 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.205.107.241:8300 server = true raft: applied_index = 54009 commit_index = 54009 fsm_pending = 0 last_contact = 61.504669ms last_log_index = 54009 last_log_term = 2 last_snapshot_index = 49164 last_snapshot_term = 2 latest_configuration = [{Suffrage:Voter ID:8ee38079-0101-bb7c-a399-241bf5021d3a Address:10.205.107.241:8300} {Suffrage:Voter ID:0ce7581a-25bd-f565-d423-83aa19d1cf74 Address:10.52.81.5:8300} {Suffrage:Voter ID:53ee0a39-fa7c-d9c3-a181-d087a76b85a4 Address:10.52.45.230:8300}] latest_configuration_index = 68 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 166 max_procs = 2 os = linux version = go1.11.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 21 members = 18 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 5 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

uname -a Linux d129668-005.foo.com 2.6.32-431.66.1.el6.x86_64 #1 SMP Fri Oct 2 13:15:53 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

As you see here, watch is firing 4 times in the same second, but then is not firing when it should:

> egrep 'watch handler|is now critical' /local/data/scratch/logs/consul_debug/1543737048/consul.log
2018/12/02 02:16:53 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-006.foo.com]' output: is_sourced=false
2018/12/02 02:16:53 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-006.foo.com]' output: is_sourced=false
2018/12/02 02:16:53 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-006.foo.com]' output: is_sourced=false
2018/12/02 02:16:54 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-006.foo.com]' output: is_sourced=false
2018/12/02 02:17:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:18:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:19:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:20:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:21:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:22:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:23:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:24:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:25:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:26:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:27:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:28:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:29:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:30:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:31:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:32:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:33:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:34:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:35:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:36:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:37:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:38:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:39:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:40:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:41:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:42:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:43:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:44:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:45:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:46:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:47:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:48:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:49:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:50:14 [WARN] agent: Check "service:NodeManager" is now critical
2018/12/02 02:51:14 [WARN] agent: Check "service:NodeManager" is now critical
pierresouchay commented 5 years ago

@dbabits Why is it supposed to fire every 60s ? It should be fired every time its state changes instead AFAIK.

dbabits commented 5 years ago

@pierresouchay, thanks. Did I misread the docs? I thought the watch would be fired every interval for as long as state is critical. I just checked the docs again, and it's not very obvious.

But anyway - why does it fire that often then? The watch fired 36 times in the sql output I showed, 4 of them show in the log snippet. The state of the service never changed - it's always critical.

pierresouchay commented 5 years ago

@dbabits it is more complicated than that: you can be sure that when service status flaps, you'll get the notification. When the output of check do change, the output is not propagated unless the state do change as well. However, if the output is modified each time even for a non flapping service, the output will be propagated eventually at anti-entropy time (which depends of the size of cluster).

So, to summarize, if for a critical service the output does not change, you'll get the notification once, otherwise, you may have some other notifications regularly.

You might have a look to https://github.com/hashicorp/consul/issues/4960 to get more info

pierresouchay commented 5 years ago

Try using the script in this comment: https://github.com/hashicorp/consul/issues/4960#issuecomment-440401193

In any case, I don't think you are using the right approach if you need to be call every minute when your check is critical, if it is for changes, it might work however

dbabits commented 5 years ago

I am OK having the watch run for changes only, so let's focus on this. I am running your debugging script now, and will let you know of results. Thx

pierresouchay commented 5 years ago

Yes, this is normal: à service changes its state if any of its check is modified OR if any agent check (eg: à check linked to the agent) does change. The reason for this: if the agent is down, your healthchecks cannot be updated.

All of this is normal and expected. The logs of the script are consistent : updates every minute or when needed, it proves you are using Consul 1.3.0+.

To me, the output is normal, I would say: "not a bug" :)

pierresouchay commented 5 years ago

@dbabits The index is actually not changing: X-Consul-Index: 223 is always 223 -> no change. As you might see, each call is separated by around 10 minutes - which is exactly the max time for which a blocking query is gonna block on Consul (+ 1/16th of 10 minutes for jitter).

So, basically, it fired only 1 as a change, and never changed after that. I am not used to watch handler since I am not using it, but in the Cluster, the script shows there are no actual changes.

pierresouchay commented 5 years ago

@dbabits Ok, you are called too often, but you might use the CONSUL_INDEX environment variable to see if something actually did change (and in that case, it will not, will always have the same value 223 ) see https://www.consul.io/docs/agent/watches.html#executable

dbabits commented 5 years ago

@pierresouchay So you suggest a workaround where I would keep track of CONSUL_INDEX myself and discard erroneous calls ? I will try, but I'm not sure it will work because your script shows that the index does not change, like you noticed("223 is always 223 -> no change"). Yet, calls are made

pierresouchay commented 5 years ago

@dbabits Watch notifies when a change occurs or every x minutes (basically when the watch ends), aka 10 minutes.

As long as there is no change, you might use the consul index to discriminate changes, for instance Insert into db(idx, state) values(“myservice-“ +CONSUL_INDEX, the state)

Thus, it would not insert the duplicated index

dbabits commented 5 years ago

@pierresouchay , Do you agree that this is a bug in either design or implementation, and offer a workaround, or do you believe everything is as it should be?

pierresouchay commented 5 years ago

@dbabits Well, it is not really a bug since watches are defined as: you are notified when there are changes, but might also be notified when no changes happen (which is the case as the Consul-Index does not change)

In that case, you are notified when nothing changes every 10 minutes, so, basically, it is not really a bug.

It would be possible to enhance "watch" in order to be notified most when there are real changes (I mean, not every 10 minutes), but anyway, you would not be sure to be notified ONLY when changes do occur. An example: when the Consul agent do restart, it cannot know what index you did receive last time, right, so, in case of restart of agent, you will be notified anyway, so being able to deal with it is a requirement anyway.

pierresouchay commented 5 years ago

OK, I finally understood: I did lots of optimizations on services, but you are basically checking on all check changes. So basically, anytime a change is occurring on any kind of check, your script will be fired...

So, most probably, that's because you are watching checks, no services.

{
        "watches": [
            {
            "type": "service",
            "service":"YOUR_SERVICE",
            "handler_type": "script",
            "args": ["your_script"]
            }
        ]
}

will have probably a result closer to what you expect... but if it continues, there is probably a bug

dbabits commented 5 years ago

how can I use this definition ? - it doesnt have a url nor port to check.. My goal is: when an HTTP(or TCP) call fails - invoke a handler. Once.

On Tue, Dec 4, 2018, 18:18 Pierre Souchay <notifications@github.com wrote:

OK, I finally understood: I did lots of optimizations on services, but you are basically checking on all check changes. So basically, anytime a change is occurring on any kind of check, your script will be fired...

So, most probably, that's because you are watching checks, no services.

{ "watches": [ { "type": "service", "service":"YOUR_SERVICE", "handler_type": "script", "args": ["your_script"] } ] }

will have probably a result closer to what you expect... but if it continues, there is probably a bug

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5044#issuecomment-444297029, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX3nn4DpI6iTlWECwIFpOp9hFAEiSbiks5u1wLNgaJpZM4Y-2nO .

banks commented 5 years ago

Finally, how to achieve what I'm actually trying - when a check fails - invoke a workflow once.

Hey, I think you reached this conclusion but the blocking query implementation which is what watching is based on specifically doesn't make the guarantee you want as it's very expensive to do in general:

A critical note is that the return of a blocking request is no guarantee of a change. It is possible that the timeout was reached or that there was an idempotent write that does not affect the result of the query.

https://www.consul.io/api/index.html#blocking-queries

That said the watch handler does at least check that the index changed before firing - if you are seeing the same index each time then it is a bug, but seeing an unchanged body is technically not a bug currently.

There are alternative mechanisms we could implement that would make this cleaner and we are considering them for the medium term - like checking a hash of the actual payload to see if it changed. We don't have firm plans for this as we were considering it as a totally new blocking mechanism right though Consul and that is a big change.

It would be simple thought to add the same thing just in the watch handler here, however note that in general you still can't rely on exactly-once delivery for correctness because that is virtually impossible in distributed systems without a lot more complication (for example if the watch command is restarted it might redeliver the same result again even with hash-based check). This is why we've stuck with the current behaviour being "good enough" and leave it up to the system using it if you need stronger guarantees - i.e you could write the hash of the result to disk each time and then just skip firing if it matches which is more reliable in general.

Hope this is useful.

dbabits commented 5 years ago

@banks Thanks, this is very useful. A couple of notes:

  1. I am seeing same index each time, and watch is firing, per this comment

  2. I believe checking hash of the payload may not work, at least in case of HTTP check, because headers and/or body may include timestamp for example

  3. And, I understand the technical difficulties you described. Without going into semantics of what is bug and what is feature, leading users to believe the software can do what it was never designed to do is wrong. Consul lead me down the wrong path with statements like this:

    The watch command provides a mechanism to watch for changes in a particular data view 
    .. and to invoke a process with the latest values of the view
    Watches are a way of specifying a view of data (e.g. .. health checks) which is
    monitored for updates. When an update is detected, an external handler is invoked. 
    .. 
    As an example, you could watch the status of health checks and notify an external system 
    when a check is critical.
    This means that an update is only triggered when a check transitions from "passing" to "failing" 
    or vice versa. (comparison vs Nagios)

    So, I tried to use watch to implement Consul 101: when something becomes unhealthy - send me an email. Once. I got spammed with emails. You are saying behaviour being "good enough" - for what use case? The idea that the user should implement all kinds of checks to see if anything has changed, if that email was already sent, etc, is understandable, but then it's unclear what value Consul brings here, because it's easier to write a loop with a curl;diff;sleep inside, and call it a day.

Consul may have many useful features. It seems to me that watch is not one of them, but I could be wrong of course. A real-life use case example would be helpful.

banks commented 5 years ago

In that case, you are notified when nothing changes every 10 minutes, so, basically, it is not really a bug.

I think we have a few confusing things here. Pierre is talking about the HTTP API blocking queries where returning every 10 mins is not a bug. I.e. his script showing you no index changing is not a bug because that is just hitting the API and telling you what gets returned after each request.

Consul's built in watch command and watch handlers in the config such as you are using heree @dbabits should not fire if index doesn't change. See https://github.com/hashicorp/consul/blob/60e2a38067a6e5ab44e02a88b8efbdb69d2a8004/watch/plan.go#L99-L109

We not only check index is unchanged, we also check the value isn't identical. If either is true it just continues the loop without invoking the handler.

Can you confirm if you see the same index when using Consul watches, not pierre's script?

My point was that in general exactly once delivery is very expensive and rarely required (this is a well known distributed systems thing) so we don't guarantee that. But we certainly shouldn't just trigger constantly with the same index - that is more about cases where there are idempotent updates etc.

Hope that's clearer. We have no intention to be misleading - watch should work "right" except in normally rare edges cases, I'm just pointing out that we intentionally don't pay the cost of tracking lots more state to provide "exactly once" because its extremely hard and most cases don't need that strength of guarantee provided they are tolerant of the occasional time when they are notified with no change.

Bug?

The fact that you see the thing trigger 4 times in a second points to something like a batch up updates from an agent doing Anti-Entropy syncs making multiple updates that all affect the index you are watching on but not the content - this is the "idempotent updates" thing.

Given that watch should be checking not only the index but also the actual payload is equal that should prevent what you see.

I wonder if reflect.DeepEqual started failing since Go added monotonic times - any time.Time value in the response would likely never compare equal any more.

If you have a way to log the actual HTTP payload you see from your watch and show us a few examples of "repeat" payloads that would be really useful - note that Pierre's script is helpful for looking at the underlying data, but not for debugging watch package which I think is the real issue here.

dbabits commented 5 years ago

@banks Hi Paul, Here's some output that I believe demonstrates the problem. The first one is from yesterday (Note the Date in the header), the second is from today. As you see, the index never changed in a day on any of the nodes.

> time curl -s --dump-header /dev/stderr "http://d341369-004:8500/v1/health/service/NodeManager?wait=3s&index=59723&pretty=true" | jq -r '.[].Checks[] |[.Node,.Name,.Status,.ModifyIndex]|@tsv'
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Effective-Consistency: leader
X-Consul-Index: 59723
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
Date: Thu, 06 Dec 2018 22:17:44 GMT
Transfer-Encoding: chunked

d341369-004   Serf Health Status      passing 19
d341369-004   NodeManager     critical        59720
d341369-005   Serf Health Status      passing 30
d341369-005   NodeManager     critical        43
d341369-006   Serf Health Status      passing 20
d341369-006   NodeManager     critical        49
d341369-007   Serf Health Status      passing 31
d341369-007   NodeManager     critical        52
d341369-008   Serf Health Status      passing 34
d341369-008   NodeManager     critical        59665
d341369-009   Serf Health Status      passing 22
d341369-009   NodeManager     critical        26
d341369-010   Serf Health Status      passing 61
d341369-010   NodeManager     critical        17
d341369-011   Serf Health Status      passing 60
d341369-011   NodeManager     critical        12
d341369-012   Serf Health Status      passing 21
d341369-012   NodeManager     critical        59723

real    0m3.219s

> time curl -s --dump-header /dev/stderr "http://d341369-004:8500/v1/health/service/NodeManager?wait=3s&index=59723&pretty=true" | jq -r '.[].Checks[] |[.Node,.Name,.Status,.ModifyIndex]|@tsv'
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Effective-Consistency: leader
X-Consul-Index: 59723
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
Date: Fri, 07 Dec 2018 15:48:17 GMT
Transfer-Encoding: chunked

d341369-004  Serf Health Status      passing 19
d341369-004  NodeManager     critical        59720
d341369-005  Serf Health Status      passing 30
d341369-005  NodeManager     critical        43
d341369-006  Serf Health Status      passing 20
d341369-006  NodeManager     critical        49
d341369-007  Serf Health Status      passing 31
d341369-007  NodeManager     critical        52
d341369-008  Serf Health Status      passing 34
d341369-008  NodeManager     critical        59665
d341369-009  Serf Health Status      passing 22
d341369-009  NodeManager     critical        26
d341369-010  Serf Health Status      passing 61
d341369-010  NodeManager     critical        17
d341369-011  Serf Health Status      passing 60
d341369-011  NodeManager     critical        12
d341369-012  Serf Health Status      passing 21
d341369-012  NodeManager     critical        59723

real    0m3.195s

Yet, the watch handler fired this many times, just on one of the nodes, since midnight. It seems that the total number of bytes changes: Note Captured 4096 of 23990 bytes , of 25362 bytes, etc. Consul truncates the output of HTTP call to 4K, and they should always the same as shown previously

grep 'watch handler' /local/data/scratch/logs/new_monitor.sh.start_consul_client.log
2018/12/07 02:00:22 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23990 bytes
2018/12/07 02:00:22 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 25362 bytes
2018/12/07 02:01:05 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 26048 bytes
2018/12/07 02:01:14 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 25362 bytes
2018/12/07 02:01:16 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 24676 bytes
2018/12/07 02:01:21 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23990 bytes
2018/12/07 02:02:05 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23304 bytes
2018/12/07 04:00:22 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23990 bytes
2018/12/07 04:01:22 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23304 bytes
2018/12/07 06:00:35 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23990 bytes
2018/12/07 06:00:35 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 24676 bytes
2018/12/07 06:01:11 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23990 bytes
2018/12/07 06:01:14 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23304 bytes
2018/12/07 09:00:10 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 45912 bytes
2018/12/07 09:01:07 [DEBUG] agent: watch handler '[/local/data/scratch/logs/new_monitor.sh -c consul_svc_handler -o d341369-004]' output: Captured 4096 of 23304 bytes
[dchdfs@d341369-004:/local/data/hadoop/hadoop/logs]curl -s -ivL http://d341369-004:8042 2>&1|wc
    102     350    3796
banks commented 5 years ago

Sorry I misspoke.

I assumed your watch config was using an HTTP handler but I see it's a script call. What I was asking for was the actual output payload (the full payload) that is being delivered to your script on subsequent calls.

What you pasted there is the output of two direct queries to the API with wait=3 nothing changed in that response between the calls so they both blocked for 3 seconds before returning the full result (we always return the full result after the wait time even if it's not changed). So yes I'd expect those to be identical! Also note that because you extracted only some fields, that may have masked other changes - any field in the whole response change is a reason to fire the watch whether or not it's the health status changing so your handler needs to be able to filter non-health-status changes it doesn't care about anyway. Like if new instances come or go, or if the tags are altered these will all cause a legitimate delivery. This is an example of why trying to make it "exactly once" is usually pointless because most contexts like this only care about a subset of the data in the response anyway but we don't know which subset so the handler needs to filter updates it doesn't care about anyway.

Can we restrict to talking about the actual payloads you see in your watch handler? Hitting API directly seems to be confusing the discussion - watch handling is a lot more sophisticated than just hitting the raw API.

What would be useful would be to see the raw JSON being sent over STDIN to you watch script on a few consecutive calls. If that payload isn't changing then it's a bug, but I suspect it is and the subtelties here are causing some confusion.

FWIW I think output: Captured 4096 of 23990 bytes in those logs is just telling you that the output of your watch handler script is changing - that should not affect the consul state or delivery of events at all.

dbabits commented 5 years ago

@banks Got it. I just modified the handler to dump full stdin into a new file every time it's invoked, so I can then diff those files. I will reply again when I have some results. Now, while we are waiting, if I replaced my check from HTTP check to script check like this:

{
  "check": {
    "args": ["curl --fail http://localhost:8042"],
    "interval": "60s",
    "timeout": "5s"
  }
}

Would it make any difference? I could even ignore the output with >/dev/null 2>&1 for good measure

dbabits commented 5 years ago

@banks Paul, I was not able to generate a good diff today, and I will be out without access to computer for a week. Please keep this thread open until then. I will be able to respond to github by emails though.

Thanks

dbabits commented 5 years ago

@banks @pearkes Here's the requested diff between two stdins passed by Consul to the watch handler (I save each stdin into a new file).

-rw-r--r-- 1 user1 user1 21453 Dec 17 08:00 /local/data/scratch/logs/consul_svc_handler.1545051609.log
-rw-r--r-- 1 user1 user1 20936 Dec 17 08:01 /local/data/scratch/logs/consul_svc_handler.1545051668.log

Some things to note:

banks commented 5 years ago

The watch handler code ignores parts of JSON that refer to other nodes

You mean your code ignores other nodes?

But Consul watch doesn't know that you only care about a single node if you are making a request that includes output from many nodes - we have to let you know about any update to that entire response since we don't have any idea which bit specifically you care about (and no way to specify partial watches currently).

So yes if you are watching health for a whole service then the watch will fire for every change to any instance or node that service runs on - it would be broken if it didn't!

Maybe I misunderstand though?

The important thing is that NodeManagers should never come up in a watch handler because that particular check is always failing

Hmm no the health HTTP API returns everything including failing nodes by default so you can use your own logic. You have to specify ?passing=true to filter it in the API.

For a watch handler I don't recall if there is anyway to specify that or if we just expect you to filter out non-passing ones if you didn't care. Note that evn if we did filter them, you'd still get an update sent through every time they got added/removed from the result since it's changed the result!

Does that clear anything up?

banks commented 5 years ago

Note that there is a bug fixed by #5126 that might have caused the agent to sync services every 2 minutes even without a change. That might explain why you see logs for that more often than expected. I don't think it changes the behavior you've described more recently which seems to be working as expected to me.

dbabits commented 5 years ago

@banks To clarify: I understand that watch handler may get invoked when something happens on some other node, because that's how Consul works.

Yes, the script handler ignores information about other nodes.

E.g. The handlers on nodes A and B get invoked and receive on stdin: [ { Node=A, Status=critical} {Node=B, Status=good} ]

So,handler script on A ignores information about B, and on B it ignores A - I coded it that way. (comparing to hostname)

The question is: if something happened on node B, why is A included in that stdin, if its status has not changed?

pierresouchay commented 5 years ago

@dbabits I think there is no real reason, it simply because it works that way, when something change, it gives you the full status of service.

It might be useful if, for instance you compute the healthiness of your service and decide to perform an action only when 50% of nodes are down.

dbabits commented 5 years ago

Please advise how to implement a simple use case: when a service become unhealthy-send me an email. thx.

On Fri, Dec 28, 2018, 04:58 Pierre Souchay <notifications@github.com wrote:

@dbabits https://github.com/dbabits I think there is no real reason, it simply because it works that way, when something change, it gives you the full status of service.

It might be useful if, for instance you compute the healthiness of your service and decide to perform an action only when 50% of nodes are down.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5044#issuecomment-450331773, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX3nq_jw2EUtgc6D-ocUtrceSK9haklks5u9es3gaJpZM4Y-2nO .

pierresouchay commented 5 years ago

@dbabits what is a service ? When a instance of a service is down? You want a change each time a service is marked down ? what when it goes up again? Do you want to be notified immediately ?

For instance, in our case, we implement this with https://prometheus.io/docs/alerting/alertmanager/ -> the user configures its threshold (for instance, I don't want more than 10% of instances down during more than 10 minutes).

dbabits commented 5 years ago

I run Hadoop services. When status of a service changes(I use http check, it could also be a curl call, or TCP check), a watch handler needs to be invoked, and its stdin should have the name of the service/host whose status changed.

When nothing changes in the status of a service, the watch handler must not be invoked, or at least it must not have that node/service in its stdin(I can filter it myself).

If nothing has changed, dont tell me that it has changed!

On Fri, Dec 28, 2018, 05:30 Pierre Souchay <notifications@github.com wrote:

@dbabits https://github.com/dbabits what is a service ? When a instance of a service is down? You want a change each time a service is marked down ? what when it goes up again? Do you want to be notified immediately ?

For instance, in our case, we implement this with https://prometheus.io/docs/alerting/alertmanager/ -> the user configures its threshold (for instance, I don't want more than 10% of instances down during more than 10 minutes).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5044#issuecomment-450337486, or mute the thread https://github.com/notifications/unsubscribe-auth/ACX3ns2MDyp7BpFiOCrrkZSiXm1xkxbxks5u9fK9gaJpZM4Y-2nO .

banks commented 5 years ago

When status of a service changes(I use http check, it could also be a curl call, or TCP check), a watch handler needs to be invoked, and its stdin should have the name of the service/host whose status changed.

When nothing changes in the status of a service, the watch handler must not be invoked, or at least it must not have that node/service in its stdin(I can filter it myself).

Unfortunately that's just not how Consul watches work. The difference is that you are watching the output of a specific query (show me all instances of this service with their checks). When anything in that output changes the watch fires and provides the full output again.

You can build what you want from this primitive but you do need to manage some state and filtering yourself.

Typically this endpoint is used for service discovery - where the set of healthy instances changes a load balancer or client wants to have the full set of good ones returned so it can just replace it's list with the new list and not have to maintain state etc. Almost all uses of watch we've seen want it to work this way!

What you want is not the same thing - it's totally possible but you'll need to do the extra filtering yourself.

What you want is more like a "feed of health check failures". It's a valid ask, just a different feature to anything that exists in Consul right now (and not one we've had before to my knowledge)! Hope that helps.

Do you still have reason to believe there is a bug here? It seems we've closed in on the expectations being different to the design rather than an actual bug?

pierresouchay commented 5 years ago

@dbabits BTW, we did something very close to what you want to do (and it is tunable) with consul-timeline, a template for consul-templaterb: https://github.com/criteo/consul-templaterb/tree/master/samples/consul-ui

I recently made a video explaining it: https://youtu.be/zLzrLGLLl4Q

You might either use it directly or inspire you from it (but it basically does exactly what you want).

Example of rendering on one of our services:

capture d ecran 2019-01-10 a 14 29 51
dbabits commented 5 years ago

@banks -

It seems we've closed in on the expectations being different to the design rather than an actual bug?

I agree with that, but I think this is worse than a bug. My expectations were born out of Consul's advertisements, more details in this comment, point 3. https://github.com/hashicorp/consul/issues/5044#issuecomment-444627157

Perhaps, if you agree, the docs could be adjusted.

@pierresouchay - thank you for the links and the screenshot - It's a very nice-looking UI! I watched the video and looked through the source code. The video mostly goes over UI features, and so far I couldn't figure out what I need to do with Consul configuration to achieve what I want. I'm not looking for a UI at the moment, just for alert-on-failure functionality

For example, looking at your screenshot, at 14:29:25 ads055 instance goes from critical -> passing at 14:29:40 it goes from passing -> critical

My problem has been that in both of these timestamps Consul would also send me info about ads049, ads042, 012, etc., whether they are passing or critical. So, how do you paint your UI correctly at each timestamp, if every time Consul dumps on you everything, not just what happened now. Or do you update your UI with the status of all 22000 services every time?

pierresouchay commented 5 years ago

@dbabits the UI is refreshed using Ajax, but basically, consul-templaterb generates an history of diff of all changes on services in the cluster. The source is here: https://github.com/criteo/consul-templaterb/blob/master/samples/consul-ui/timeline.json.erb so, it generates a very big JSON file (in our case, we configure it to have the last 100k changes) with all changes in the cluster sorted sequentially.

The UI is auto-refreshed with Ajax and load the JSON file every 30s.

But basically, it does exactly what you want to achieve with your alerts

dbabits commented 5 years ago

@pierresouchay But you are not using watch functionality, are you? Precisely because it cannot tell you what changed. You had to write a big state management program and do your own diffing to figure it out. What am I missing?

pierresouchay commented 5 years ago

Not watch, but blocking queries (which give you a result when something change on a service), just another way of implementing it

dbabits commented 5 years ago

@pierresouchay So your Consul agent config has neither 'check', no 'watches' section - is this correct? (maybe you couls share relevant configs)

And, it seems that you run your blocking queries and timeline.json.erb from one central place, not on each of 22000 nodes - is this correct?

pierresouchay commented 5 years ago

@pierresouchay So your Consul agent config has neither 'check', no 'watches' section - is this correct? (maybe you couls share relevant configs)

Yes, no need

And, it seems that you run your blocking queries and timeline.json.erb from one central place, not on each of 22000 nodes - is this correct?

Yes again, it can be even on a remote machine Y

dbabits commented 5 years ago

@pierresouchay Ok, I think you use Consul as a sort of database that keeps current state of your cluster. You poll it centrally, and compare with the previous state that you store outside of Consul.

check and watch combo on the other hand push stuff to you, but unfortunately not just the changes, so you couldn't use them. It works fine for your webpage. In my case, I also need to do restarts, kills, and other workflows, that must run locally on each of the 22000 nodes. Doing it centrally will not scale and will require ssh to each of the 22000 hosts, and so for me would not work.

I could of course do what you do centrally on each of the 22000 hosts, but I think it just highlights that Consul is not the tool for the job

pierresouchay commented 5 years ago

No, consul is not a database in our case.

You want to check the status of your local node only, http://localhost:8500/v1/agent/checks does it. 1 simple diff every minute on curl's result would do the job.

You are probably trying to use the wrong tool: systemd, mesos, nomad or kubernetes is probably what you are looking for.

dbabits commented 5 years ago

You want to check the status of your local node only, http://localhost:8500/v1/agent/checks does it. 1 simple diff every minute on curl's result would do the job.

Agree. I can also just curl my Hadoop service and diff the result, what's the use for Consul here ?

You are probably trying to use the wrong tool: systemd, mesos, nomad or kubernetes is probably what you are looking for.

Monit is the right tool, this is how it can be configured:

     check program NodeManager with path "curl --fail http://$(hostname):8042"
       if status != 0 then exec "start_node_manager.sh"

Unfortunately it lacks the distributed nature of Consul with membership information, distributed state, etc. and that's how I got into this.

banks commented 5 years ago

Sorry to hear the docs were misleading on this @dbabits I agree we can improve them.

To help us do that if you could share some more info on the following that would be great:

The watch command provides a mechanism to watch for changes in a particular data view .. and to invoke a process with the latest values of the view

The "view" here is the result of your query. For a "service" type watch query the result is "all the nodes that serve this service". So the watch handler is notifying you of any change to that view just as it says. Can you elaborate on which part was confusing for you? Maybe the fact you can't watch one specific service instance?

Watches are a way of specifying a view of data (e.g. .. health checks) which is monitored for updates. When an update is detected, an external handler is invoked. .. As an example, you could watch the status of health checks and notify an external system when a check is critical.

This certainly doesn't explain in gory detail all the subtleties of filtering the response to what you care about and ignoring updates that don't actually change that subset of the response. I think it would be clearer if we didn't just use "health checks" as an example since that's vague and actually said "all the health checks for all instances of a service" which is the actually granularity you can watch right now. Would that help?

This means that an update is only triggered when a check transitions from "passing" to "failing" or vice versa. (comparison vs Nagios)

For context to others this is from the comparison with Nagios page. Again it's not wrong but it's talking about how it reports check failures from the local agent up to the server not about watches. The local agent only has to send dat to servers when the state changes rather than streaming the result of every check which is why it's more scalable. But that is not related to how you can consume checks from external tools directly.

Would it help to clarify that?

Agree. I can also just curl my Hadoop service and diff the result, what's the use for Consul here ?

There isn't one - as you right said tools like monit are perfect if all you want to do is monirot a service on the same host that is not really Consul's goal.

Unfortunately it lacks the distributed nature of Consul with membership information, distributed state, etc. and that's how I got into this.

Can you elaborate on what you need from Consul? If all you want is same-host health checks then you are right Consul buys you nothing, but if you need more like global service discovery with health checks then that is exactly the case Consul can help with!


I think what you actually want is just a slightly more fine-grained watch - the current implementation could do what you want if you actually could specify a service-instance watch type - currently you can only watch the state of a whole service (i.e. all instances of it in the cluster).

That's a valid feature request although it's not one we've heard a lot of and it's possible to do equivalent already with some extra filtering work as we've described here so it's hard to know if it will get prioritized for development resources soon.

dbabits commented 5 years ago

@banks Thanks for the follow-up.

I'm sure we all agree that users should not get confused by the docs and misunderstand the capabilities of a given product. Even if the user is dumb, the docs should help.

I think Consul documents various APIs and commands well, but it's easy to not see the forest for the trees. I think what would be useful is Recipes section for various use cases. Interestingly, I see that Serf docs have such section, but Consul doesn't. my 2c.

I could volunteer myself to write a Recipe for Watch, but honestly - I still don't understand what it could be useful for. @pierresouchay showed his use case, but then it turned out he's not using Watch

Can you elaborate on what you need from Consul? If all you want is same-host health checks then you are right Consul buys you nothing, but if you need more like global service discovery with health checks then that is exactly the case Consul can help with!

Does it answer your questions? Btw, if there's a way to talk - I'd be more than happy, many things would probably clear up that way.

banks commented 5 years ago

Thanks that's fantastic context.

I understand the issue now - it's one case where current blocking queries (which watch is only a very thin layer on top of) just can't do super well. I have medium term plans for an alternative but it's more or less a major fundamental change to Consul's design and architecture! It's possible to do what you want with some state management and diffs but it's not super efficient to do so, so it's somewhat left up to the user if they need that. At least that way the potential high cost is obvious and not hidden by Consul! When we get around to the changes to have a delta streaming rather than the current long poll with reloading whole result then the story here gets way better!

On Thu, Jan 17, 2019 at 4:55 PM Dmitry Babitsky notifications@github.com wrote:

@banks https://github.com/banks Thanks for the follow-up.

I'm sure we all agree that users should not get confused by the docs and misunderstand the capabilities of a given product. Even if the user is dumb, the docs should help.

I think Consul documents various APIs and commands well, but it's easy to not see the forest for the trees. I think what would be useful is Recipes section for various use cases. Interestingly, I see that Serf docs have such section, but Consul doesn't. my 2c.

I could volunteer myself to write a Recipe for Watch, but honestly - I still don't understand what it could be useful for. @pierresouchay https://github.com/pierresouchay showed his use case, but then it turned out he's not using Watch

Can you elaborate on what you need from Consul? If all you want is same-host health checks then you are right Consul buys you nothing, but if you need more like global service discovery with health checks then that is exactly the case Consul can help with!

-

I need to implement a system that would receive signals when my daemons become unhealthy, and take some action in response. The same-host part is not strictly true.

Example: If I have 22000 hosts with Consul agents on them, and my one daemon on 1 host becomes unhealthy - let all 22000 watch handlers receive a message about that one state change. Do not include 21999 others that haven't changed!

Put another way - if you look at what @pierresouchay https://github.com/pierresouchay had to do here: https://github.com/criteo/consul-templaterb/blob/master/samples/consul-ui/timeline.json.erb (search for diff) - that exactly what I don't want to have to do! I believe it should be Consul's job.

Back to how best to document it all, today I would show that snippet from Monit #5044 (comment) https://github.com/hashicorp/consul/issues/5044#issuecomment-453635399, and say that if that's what you're looking for - Consul can't do it, don't even try. Tomorrow, I would hopefully be able to say: Consul can do it too, but it also has all the other functionality that allows you to view global state of your cluster.

Does it answer your questions? Btw, if there's a way to talk - I'd be more than happy, many things would probably clear up that way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5044#issuecomment-455246670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYU_ai3LTbggHpyXPDdfYqimpukrHxks5vEKr_gaJpZM4Y-2nO .

stale[bot] commented 4 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

dbabits commented 4 years ago

yes, it's still an important issue and caused us to abandon Consul effort. Last we tried it was Consul 1.4.0 Since this issue is apparently by design, I would not expect it to fix itself unless owners agreed it should be changed. The second issue was documentation that I believe was misleading and led us down the wrong path.

On Mon, Oct 21, 2019, 15:11 stale[bot] notifications@github.com wrote:

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul https://github.com/hashicorp/consul/blob/master/CHANGELOG.md or its documentation https://www.consul.io/docs please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum https://discuss.hashicorp.com/c/consul as well! Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/5044?email_source=notifications&email_token=AAS7PHTJHDMEM2RUIMBLJU3QPX5F5A5CNFSM4GH3NHHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3OI7A#issuecomment-544662652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS7PHQSOAAXO6XJ4H7IXJDQPX5F5ANCNFSM4GH3NHHA .

stale[bot] commented 4 years ago

Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well! Thank you!

dbabits commented 4 years ago

yes, the problem is still there