hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 781 forks source link

consul-template flooding consul with dedup HTTP requests #677

Closed ghost closed 8 years ago

ghost commented 8 years ago

This issue follows on from #608 where we are finding consul-template (consul-template v0.15.0 and prior versions) to use most of the CPU on our consul masters (which happen to also run consul-template with dedup).

The issue appears to be tied to dedup because if we stop consul template on all other Consul servers then consul-template behaves as expected (renders the templates then waits on dependencies).

However, when we have one consul template on each consul master rendering the same template, but using dedup, consul-template gets into a state where it doesn't wait on dependencies, but keeps re-rendering the template files (or at least, reloading the dependencies) in a tight loop - which causes CPU spikes from consul-template and the elected consul leader. This tends to start between 15 minutes to an hour after I've done a rolling restart of consul-template on each consul master to stabilize the situation.

ghost commented 8 years ago

The cause of the high CPU load appears to be consul-template flooding Consul with HTTP requests for dedup (approx. 100 HTTP requests/sec for 17 templates with dedup).

sudo tcpdump -i lo -An port 8500 2>&1 | grep -o 'GET /v1/kv/consul-template/dedup/.*'

here is just under a second of HTTP requests from tcpdump:

GET /v1/kv/consul-template/dedup/nomad/f7e9a06293055f69a6f8ec95e328c1a8/data?index=251334749&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?index=251334750&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a762600947c0f938e1729264124a0277/data?index=251334751&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a0c1cc924ebff900567cee614d94dc59/data?index=251334752&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/data?index=251334754&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?index=251334755&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a662b8011cf7a18d5251173e30d31138/data?index=251334756&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?index=251334757&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334758&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334759&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?index=251334760&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/38e455bec052b0a2f14e4e2c863deb19/data?index=251334762&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?index=251334763&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a762600947c0f938e1729264124a0277/data?index=251334764&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?index=251334765&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/data?index=251334767&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a0c1cc924ebff900567cee614d94dc59/data?index=251334768&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a662b8011cf7a18d5251173e30d31138/data?index=251334769&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?index=251334770&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?index=251334771&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1ddc6ad57e42a1ecf5f770e11b1065db/data?index=251334772&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334773&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334776&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?index=251334775&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a0c1cc924ebff900567cee614d94dc59/data?index=251334778&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/2a385e1afa7dfa9788d90cc9d03773d3/data?index=251334777&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/38e455bec052b0a2f14e4e2c863deb19/data?index=251334780&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/data?index=251334781&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a662b8011cf7a18d5251173e30d31138/data?index=251334782&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334783&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?index=251334785&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?index=251334787&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334786&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/2a385e1afa7dfa9788d90cc9d03773d3/data?index=251334788&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?index=251334789&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a762600947c0f938e1729264124a0277/data?index=251334790&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a0c1cc924ebff900567cee614d94dc59/data?index=251334793&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a662b8011cf7a18d5251173e30d31138/data?index=251334792&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data?index=251334794&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?index=251334795&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?index=251334796&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1ddc6ad57e42a1ecf5f770e11b1065db/data?index=251334797&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334799&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/f7e9a06293055f69a6f8ec95e328c1a8/data?index=251334802&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334800&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?index=251334803&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a762600947c0f938e1729264124a0277/data?index=251334804&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?index=251334805&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data?index=251334806&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1ddc6ad57e42a1ecf5f770e11b1065db/data?index=251334808&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?index=251334809&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334810&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?index=251334813&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/2a385e1afa7dfa9788d90cc9d03773d3/data?index=251334811&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/38e455bec052b0a2f14e4e2c863deb19/data?index=251334814&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334815&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/f7e9a06293055f69a6f8ec95e328c1a8/data?index=251334817&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?index=251334818&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a762600947c0f938e1729264124a0277/data?index=251334819&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?index=251334820&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/a0c1cc924ebff900567cee614d94dc59/data?index=251334822&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data?index=251334823&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?index=251334825&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/1ddc6ad57e42a1ecf5f770e11b1065db/data?index=251334824&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?index=251334826&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?index=251334827&stale=&wait=60000ms HTTP/1.1
GET /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?index=251334829&stale=&wait=60000ms HTTP/1.1

and over the course of a minute the following number of HTTP requests were made for each dedup ID:

 1c06af396f256f69de4acadfab04a437: 350
 1ddc6ad57e42a1ecf5f770e11b1065db: 567
 2a385e1afa7dfa9788d90cc9d03773d3: 424
 38e455bec052b0a2f14e4e2c863deb19: 418
 6571bcf24e99161c979591aa95f5022b: 566
 9a18225a60accddaa034b419512a80e2: 556
 9e735d522b6a964aaed9e414c826d856: 439
 a0c1cc924ebff900567cee614d94dc59: 549
 a662b8011cf7a18d5251173e30d31138: 330
 a762600947c0f938e1729264124a0277: 423
 a7d38398d4e420e1917db6a0aa374ef6: 565
 b02c27da9e5f3b38d542559412020630: 566
 bdfc82c35f0e1232b24208ee068c865f: 426
 c7d1437c64451c91afbcc7e3c83451c9: 422
 db5fdb701bbb6668adc72c121eaab4c9: 565
 f7e9a06293055f69a6f8ec95e328c1a8: 425

I'm going to spend some time looking into the dedup code to see what's causing this flooding.

ghost commented 8 years ago

The tight loop appears to be in the watchTemplate function on line 285 of dedup.go. I can confirm that the start goto is entered in a tight loop having entered debugging log lines after the start block:

START:
    log.Printf("entered start goto")

which has the following output:

2016/08/01 11:26:38.696740 entered start goto
2016/08/01 11:26:38.721988 entered start goto
2016/08/01 11:26:38.722875 entered start goto
2016/08/01 11:26:38.724188 entered start goto
2016/08/01 11:26:38.724775 entered start goto
2016/08/01 11:26:38.738629 entered start goto
2016/08/01 11:26:38.739783 entered start goto
2016/08/01 11:26:38.740929 entered start goto
2016/08/01 11:26:38.774037 entered start goto
2016/08/01 11:26:38.775331 entered start goto
2016/08/01 11:26:38.778038 entered start goto
2016/08/01 11:26:38.778705 entered start goto
2016/08/01 11:26:38.779033 entered start goto
2016/08/01 11:26:38.806861 entered start goto
2016/08/01 11:26:38.807635 entered start goto
2016/08/01 11:26:38.808747 entered start goto
2016/08/01 11:26:38.809374 entered start goto
2016/08/01 11:26:38.810091 entered start goto

Then we pass the first select without blocking (because ok is false):

2016/08/01 11:30:42.752078 passed first select, ok is false

// If we are current the leader, wait for leadership lost
d.leaderLock.RLock()
lockCh, ok := d.leader[t]
d.leaderLock.RUnlock()
if ok {
    select {
    case <-lockCh:
        goto START
    case <-d.stopCh:
        return
    }
}

log.Printf("passed first select, ok is %v", ok) // custom log line

and then we should block for an update:

log.Printf("should block for update %v", t.hexMD5) // custom log line
// Block for updates on the data key
log.Printf("[INFO] (dedup) listing data for template hash %s", t.hexMD5)
pair, meta, err := client.KV().Get(path, opts)
if err != nil {
    log.Printf("[ERR] (dedup) failed to get '%s': %v", path, err)
    select {
    case <-time.After(listRetry):
        goto START
    case <-d.stopCh:
        return
    }
}
log.Printf("received update %v", t.hexMD5) // custom log line

however, the GETs do not block (the following log lines are all the for the template with hash starting 1c06):

2016/08/01 11:36:23.092347 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.192416 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.192692 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:23.193182 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.193380 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.580589 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.591691 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:23.592275 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.592370 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.680161 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.680641 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:23.681029 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.681073 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.834332 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.834640 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:23.835753 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:23.835881 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.172608 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.173608 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:24.174152 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.174561 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.311755 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.312133 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:24.354071 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.354395 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.539163 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.541914 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:24.542204 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.542362 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.751070 received update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.751479 [INFO] (dedup) loading 3 dependencies from 'consul-template/dedup/nomad/1c06af396f256f69de4acadfab04a437/data'
2016/08/01 11:36:24.752694 should block for update 1c06af396f256f69de4acadfab04a437
2016/08/01 11:36:24.752834 [INFO] (dedup) listing data for template hash 1c06af396f256f69de4acadfab04a437

and then we fall through to the end of the function (where goto START) is called and the loop repeats:

// Parse the data file
if pair != nil && pair.Flags == templateDataFlag {
    d.parseData(pair.Key, pair.Value)
}
log.Printf("goto start end")
goto START

}

which I've logged below:

2016/08/01 11:39:35.744227 goto start end
2016/08/01 11:39:35.794733 goto start end
2016/08/01 11:39:35.798603 goto start end
2016/08/01 11:39:35.800368 goto start end
2016/08/01 11:39:35.801109 goto start end
2016/08/01 11:39:35.806393 goto start end
2016/08/01 11:39:35.826400 goto start end
2016/08/01 11:39:35.865822 goto start end
2016/08/01 11:39:35.867331 goto start end
2016/08/01 11:39:35.871279 goto start end
2016/08/01 11:39:35.876366 goto start end
2016/08/01 11:39:35.879766 goto start end
2016/08/01 11:39:35.881514 goto start end
2016/08/01 11:39:35.883332 goto start end
2016/08/01 11:39:35.888859 goto start end
2016/08/01 11:39:35.890429 goto start end
2016/08/01 11:39:35.897074 goto start end
2016/08/01 11:39:35.899727 goto start end
2016/08/01 11:39:35.908003 goto start end
2016/08/01 11:39:35.909951 goto start end
jippi commented 8 years ago

are you having excessive writing on your templates going on?

or do you maybe have a proxy in front of the consul servers?

ghost commented 8 years ago

@jippi nope - templates are static and output files haven't been written in 48+ hours. This is a follow up of https://github.com/hashicorp/consul-template/issues/608 so I think its a bug.

ghost commented 8 years ago

It appears that the bug is not in the reading of dedup - the wait index that is passed on line https://github.com/hashicorp/consul-template/blob/master/dedup.go#L324 is correct.

Instead, I've just logged into another server that is the leader and I've found that it is writing dedup data in a tight loop - causing non-leaders to read this data in a tight loop too.

PUT /v1/kv/consul-template/dedup/nomad/73d12f2b3f0d9558664c4e8a52f106ec/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/1ddc6ad57e42a1ecf5f770e11b1065db/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/73d12f2b3f0d9558664c4e8a52f106ec/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/6571bcf24e99161c979591aa95f5022b/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/9a18225a60accddaa034b419512a80e2/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/db5fdb701bbb6668adc72c121eaab4c9/data?flags=2502208259587323168 HTTP/1.1
PUT /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?flags=2502208259587323168 HTTP/1.1
ghost commented 8 years ago

I'm now able to replicate the issue about 50% of the time. I start consul-template on three instances in dedup mode (and the debug log level to trace the output, but not required to reproduce).

I then run curl -X DELETE 127.0.0.1:8500/v1/kv/consul-template?recurse=true and in about 50% of cases it causes all three instances of consul-template (each on a different VM) to get stuck in the loop.

I see lots of log lines where consul-template has acquired a lock and then released it:

2016/08/01 14:13:48.586778 [INFO] (dedup) acquired lock 'consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/lock'
2016/08/01 14:13:48.586794 [INFO] (dedup) releasing lock 'consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/lock'

and I also see lots of log lines where consul-template is writing the same dedup data multiple times per second (see line #1 and line #5):

.e.6.e.6PUT /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?flags=2502208259587323168 HTTP/1.1
.e.:.e.:PUT /v1/kv/consul-template/dedup/nomad/c7d1437c64451c91afbcc7e3c83451c9/data?flags=2502208259587323168 HTTP/1.1
.e.@.e.=PUT /v1/kv/consul-template/dedup/nomad/9e735d522b6a964aaed9e414c826d856/data?flags=2502208259587323168 HTTP/1.1
.e.B.e.APUT /v1/kv/consul-template/dedup/nomad/b02c27da9e5f3b38d542559412020630/data?flags=2502208259587323168 HTTP/1.1
.e.E.e.DPUT /v1/kv/consul-template/dedup/nomad/a7d38398d4e420e1917db6a0aa374ef6/data?flags=2502208259587323168 HTTP/1.1

meanwhile all three servers run checking ctemplate for the same templates, which could suggest that each consul-template process thinks its the leader (and could explain the instances reading and writing dedup data in an infinite loop) - or there are frequent elections:

2016/08/01 15:09:46 [DEBUG] (runner) checking ctemplate &{Source:/etc/consul-template/nomad/tmpl-source/charts.nomad.hcl.ctmpl Destination:/etc/nomad/jobs/charts.hcl Command:nomad run /etc/nomad/jobs/charts.hcl || true CommandTimeout:30s Perms:-rw-r--r-- Backup:false LeftDelim: RightDelim: Wait:0xc8200f80e0}

and (on a different VM):

2016/08/01 15:10:07 [DEBUG] (runner) checking ctemplate &{Source:/etc/consul-template/nomad/tmpl-source/charts.nomad.hcl.ctmpl Destination:/etc/nomad/jobs/charts.hcl Command:nomad run /etc/nomad/jobs/charts.hcl || true CommandTimeout:30s Perms:-rw-r--r-- Backup:false LeftDelim: RightDelim: Wait:0xc8200f80e0}
sethvargo commented 8 years ago

Ping @armon

armon commented 8 years ago

@grobinson-blockchain Can you provide the template that you are rendering? I'm wondering if the de-dup key is creating a feedback loop into the template?

ghost commented 8 years ago

We've set up consul-template to use an include dir rather than a single configuration file.

This dir has a configuration file that has the Consul ACL token and dedup info:

{
  "token": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "deduplicate": {
    "prefix": "consul-template/dedup/nomad",
    "enabled": true
  }
}

and then each of the templates has a configuration file like so:

{
  "template": [
    {
      "source": "/etc/consul-template/nomad/tmpl-source/demo.nomad.hcl.ctmpl",
      "destination": "/etc/nomad/jobs/demo.hcl",
      "command": "nomad run /etc/nomad/jobs/demo.hcl || true"
    }
  ]
}

and the following template:

job "demo" {
    type = "service"
    region = "europe"
    datacenters = ["europe-west1"]

    update {
        stagger = "30s"
        max_parallel = 1
    }

    group "demo" {
        count = {{key_or_default "nomad/jobs/demo/count" "1"}}

        task "demo" {
            driver = "docker"

            config {
                image = "{{with $v := key "nomad/jobs/demo/releases" | parseJSON}}{{if $v}}{{$v.latest.registry}}/{{$v.latest.repository}}:{{$v.latest.tag}}{{end}}{{end}}"
                dns_servers = [
                    "172.17.0.1"
                ]
                port_map = {
                    http = 8080
                }
            }

            constraint {
                attribute = "${node.class}"
                value = "backend-generic"
            }

            service {
                name = "demo"
                port = "http"
                check {
                    type = "http"
                    path = "/healthz"
                    interval = "10s"
                    timeout = "2s"
                }
            }

            env {
                {{range $v := key "nomad/jobs/defaults/env" | split "\n"}}
                {{$v}}{{end}}
                {{range $v := key "nomad/jobs/demo/env" | split "\n"}}
                {{$v}}{{end}}
            }

            resources {
                cpu = 512
                memory = 1024
                network {
                    mbits = 1
                    port "http" {}
                }
            }
        }
    }
}
ghost commented 8 years ago

FWIW - the entire process renders Nomad files and nothing else - we have 17 templates (because we have 17 Nomad jobs) - which all have the same format.

ghost commented 8 years ago

I'm afraid to report that last night consul-template appears to have brought down our Consul cluster in production (5 servers, 30 clients - each server pinned at around 96% CPU usage from to consul-template) due to the high load on the leader (forcing event member failures).

Aug 13 01:55:57  consul[27056]: 2016/08/13 01:55:57 [INFO] consul: New leader elected: xxx
Aug 13 01:56:07  consul[27056]: 2016/08/13 01:56:07 [INFO] serf: EventMemberFailed: xxx 10.134.0.4
Aug 13 01:56:07  consul[27056]: 2016/08/13 01:56:07 [INFO] consul: removing server xxx (Addr: 10.134.0.4:8300) (DC: europe-west1)
Aug 13 01:56:44  consul[27056]: 2016/08/13 01:56:44 [INFO] serf: attempting reconnect to xxx 10.134.0.4:8301
Aug 13 01:56:44  consul[27056]: 2016/08/13 01:56:44 [INFO] serf: EventMemberJoin: xxx 10.134.0.4
Aug 13 01:56:44  consul[27056]: 2016/08/13 01:56:44 [INFO] consul: adding server xxx (Addr: 10.134.0.4:8300) (DC: europe-west1)
Aug 13 02:21:44  consul[27056]: 2016/08/13 02:21:44 [INFO] serf: attempting reconnect to xxx 10.134.0.27:8301
Aug 13 02:46:54  consul[27056]: 2016/08/13 02:46:54 [INFO] serf: attempting reconnect to xxx 10.134.0.27:8301
Aug 13 03:03:34  consul[27056]: 2016/08/13 03:03:34 [INFO] serf: attempting reconnect to xxx 10.134.0.27:8301
Aug 13 03:36:14  consul[27056]: 2016/08/13 03:36:14 [INFO] serf: attempting reconnect to xxx 10.134.0.27:8301
Aug 13 03:47:24  consul[27056]: 2016/08/13 03:47:24 [INFO] serf: attempting reconnect to xxx 10.134.0.27:8301
Aug 13 04:14:55  consul[27056]: 2016/08/13 04:14:55 [INFO] consul: New leader elected: xxx
Aug 13 04:15:04  consul[27056]: 2016/08/13 04:15:04 [INFO] memberlist: Marking xxx as failed, suspect timeout reached
Aug 13 04:15:04  consul[27056]: 2016/08/13 04:15:04 [INFO] serf: EventMemberFailed: xxx 10.134.0.3
Aug 13 04:15:04  consul[27056]: 2016/08/13 04:15:04 [INFO] consul: removing server xxx (Addr: 10.134.0.3:8300) (DC: europe-west1)
Aug 13 04:15:04  consul[27056]: 2016/08/13 04:15:04 [INFO] serf: EventMemberJoin: xxx 10.134.0.3
Aug 13 04:15:04  consul[27056]: 2016/08/13 04:15:04 [INFO] consul: adding server xxx (Addr: 10.134.0.3:8300) (DC: europe-west1)
sethvargo commented 8 years ago

Hi @grobinson-blockchain

You don't have a ton of dependencies in your template, so I would recommend turning off de-duplicate mode for the time being, until we can resolve this issue. At 17 templates, you'll only have 68 edge queriers by my math, which should well-under anything that would DDOS your Consul cluster.

ghost commented 8 years ago

Thanks! I've disabled dedup for the time being (CPU usage has dropped from 96% to 2% which is great) - however this is quite a serious bug for us so I will also be spending some more time to find the cause and submit at PR if possible.

sethvargo commented 8 years ago

Okay great - I agree this is a sad bug, but three people on our team have tried to track it down to no success, so I fear it's going to be a magical incantation of your Consul setup, Nomad setup, and Consul Template invocation. I'm not saying it's your fault, but we've been unable to reproduce this successfully at all, so there is likely some external variable we are not controlling.

It would be very helpful if you could provide a bash script(s) to reproduce this issue semi-consistently. This will help greatly in narrowing down the cause (and fixing it). 😄

ghost commented 8 years ago

Hi Seth,

First and foremost, thank you to you and your team for investing time on this issue! Your efforts have not gone unappreciated - even if we haven't yet found the cause. I'll do my very best to find a reliable way to reproduce the issue deterministically (if possible) to help us narrow it down. None the less, I'll continue to spend my time debugging this issue individually and I'll keep you and the team updated via this GitHub issue, if that's okay?

Cheers

slackpad commented 8 years ago

Hi @grobinson-blockchain which version of Consul are you running? There was a strange KV blocking bug in https://github.com/hashicorp/consul/pull/1632, but that has been fixed if you are running 0.6.4. Probably not the problem here but I wanted to double check.

ghost commented 8 years ago

Hi @slackpad - I'm afraid we're running 0.6.4.

:~$ consul -v
Consul v0.6.4
Consul Protocol: 3 (Understands back to: 1)
slackpad commented 8 years ago

Hi @grobinson-blockchain I took a deep look through the code tonight and couldn't come up with anything conclusive. The only odd thing was this from your info above:

2016/08/01 14:13:48.586778 [INFO] (dedup) acquired lock 'consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/lock'
2016/08/01 14:13:48.586794 [INFO] (dedup) releasing lock 'consul-template/dedup/nomad/bdfc82c35f0e1232b24208ee068c865f/lock'

Looking at the code, it seems like the only way we could get into that situation is if sessionCh closes when we attempt the lock, which will return nil, nil so the logs will think it acquired but it will immediately fall through and write a nil back to the leader slot for that template.

Something like this may help:

diff --git a/dedup.go b/dedup.go
index 475ef88..0f7f822 100644
--- a/dedup.go
+++ b/dedup.go
@@ -163,6 +163,7 @@ START:
                log.Printf("[ERR] (dedup) failed to renew session: %v", err)
        }
        close(sessionCh)
+       d.wg.Wait()

 WAIT:
        select {

This'll wait for all the existing acquire attempts to close out before starting a new round under a different session. I can't prove it, but it seems like there could be some racy behavior here where this kind of thing happens:

  1. The lock is lost and we try to re-acquire.
  2. Something happens to the session and that expires, too.
  3. createSession() loops back around and makes a new session, and one of its new attemptLock() calls manages to get a lock.
  4. An old attemptLock() call from the previous session runs, gets the nil, nil back from the lock, thinks it acquires, then releases the lock, finally nil-ing out the leader lock, which stomps on the real lock.
  5. Some kind of loop ensues since it thinks it lost the lock and/or watchTemplate() is waiting on an old leaderCh that got abandoned or is in a weird state.

Do you have a safe test environment you could maybe try this patch in?

slackpad commented 8 years ago

Sorry made some edits to that previous comment - you may want to read online if you were reading it via email :-)

slackpad commented 8 years ago

If this doesn't help, I think the next thing if you can't share a repro case would be to post a gist with full logs from consul-template when this happens so we can try to piece together a timeline / path through the code. Debug level would be good if you can.