Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
Strangely enough this only seems to have a chance of breaking when running all of the agent/consul tests at once... could be shared state between them or some kind of resource starvation? Need to run it again w/ make test-flake.
Reproduction Steps
Test flake first occurred running the entire ./agent/consul suite
gotestsum -f short-verbose -- ./agent/consul -timeout 0
Running:
gotestsum -f short-verbose -- ./agent/consul -run TestConnectCAConfig_TriggerRotation -timeout 0 -count 1000
was not enough to get a fail. Need to try to repro with make test-flake
Operating system and Environment details
macOS 10.14.6
Log Fragments
Logs from local run:
=== RUN TestConnectCAConfig_TriggerRotation
=== PAUSE TestConnectCAConfig_TriggerRotation
=== CONT TestConnectCAConfig_TriggerRotation
--- FAIL: TestConnectCAConfig_TriggerRotation (1.45s)
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.179050 [DEBUG] tlsutil: Update with version 1
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.179628 [DEBUG] tlsutil: OutgoingRPCWrapper with version 1
writer.go:21: 2020/01/24 14:44:20 [INFO] raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:7c87f8f5-0ada-2ccc-14e6-7b6158365dcf Address:127.0.0.1:12011}]"
writer.go:21: 2020/01/24 14:44:20 [INFO] raft: entering follower state: follower="Node at 127.0.0.1:12011 [Follower]" leader=
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.472580 [INFO] serf: EventMemberJoin: TestConnectCAConfig_TriggerRotation-node-244.dc1 127.0.0.1
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479788 [INFO] serf: EventMemberJoin: TestConnectCAConfig_TriggerRotation-node-244 127.0.0.1
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479945 [INFO] consul: Adding LAN server TestConnectCAConfig_TriggerRotation-node-244 (Addr: tcp/127.0.0.1:12011) (DC: dc1)
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479964 [INFO] consul: Handled member-join event for server "TestConnectCAConfig_TriggerRotation-node-244.dc1" in area "wan"
writer.go:21: 2020/01/24 14:44:20 [WARN] raft: heartbeat timeout reached, starting election: last-leader=
writer.go:21: 2020/01/24 14:44:20 [INFO] raft: entering candidate state: node="Node at 127.0.0.1:12011 [Candidate]" term=2
writer.go:21: 2020/01/24 14:44:20 [INFO] raft: election won: tally=1
writer.go:21: 2020/01/24 14:44:20 [INFO] raft: entering leader state: leader="Node at 127.0.0.1:12011 [Leader]"
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.889094 [INFO] consul: cluster leadership acquired
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.889199 [INFO] consul: New leader elected: TestConnectCAConfig_TriggerRotation-node-244
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.022334 [DEBUG] tlsutil: OutgoingRPCWrapper with version 1
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.023422 [DEBUG] consul CA provider configured ID=07:80:c8:de:f6:41:86:29:8f:9c:b8:17:d6:48:c2:d5:c5:5c:7f:0c:03:f7:cf:97:5a:a7:c1:68:aa:23:ae:81 IsPrimary=true
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174448 [INFO] connect: initialized primary datacenter CA with provider "consul"
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174514 [INFO] leader: started CA root pruning routine
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174620 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174652 [INFO] consul: member 'TestConnectCAConfig_TriggerRotation-node-244' joined, marking health alive
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.233543 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.262598 [DEBUG] consul CA provider configured ID=48:45:95:7e:ac:ba:93:c0:ab:a1:ab:2f:63:48:1d:fa:55:ec:5e:92:79:5c:57:16:10:56:9b:06:0b:86:7c:63 IsPrimary=true
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.375184 [INFO] connect: pruning old unused root CA (ID: 96:26:f0:4d:d9:42:d4:93:ff:d4:bd:4a:a5:91:69:03:11:df:f5:4c)
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.446295 [INFO] connect: CA rotated to new root under provider "consul"
connect_ca_endpoint_test.go:306:
Error Trace: connect_ca_endpoint_test.go:306
Error: "[%!s(*structs.CARoot=&{9e:84:70:af:1b:91:da:44:f5:b3:82:41:e9:05:be:a4:d7:f3:57:ba Consul CA Root Cert 8 81:3b:4a:7b:b4:2b:75:75:59:72:cc:bd:4a:bd:90:a5:2b:b0:e1:50:6f:ba:ad:73:8e:cf:16:f5:34:e9:5a:05 11111111-2222-3333-4444-555555555555 {0 63715502661 <nil>} {0 64031121861 <nil>} -----BEGIN CERTIFICATE-----
MIICDjCCAbOgAwIBAgIBCDAKBggqhkjOPQQDAjAwMS4wLAYDVQQDEyVwcmktbGIw
enh2aC5jb25zdWwuY2EuMTExMTExMTEuY29uc3VsMB4XDTIwMDEyNDIyNDQyMVoX
DTMwMDEyNDIyNDQyMVowMDEuMCwGA1UEAxMlcHJpLWxiMHp4dmguY29uc3VsLmNh
LjExMTExMTExLmNvbnN1bDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABBJ2ONXb
Bl227Dn39DRTO3Ls4Edy1ckZ+hnJxUsxdRQEjs6TBIVD4sU+cXn/MxtwYsyFktGQ
VQm3Oqz4vFoJVYqjgb0wgbowDgYDVR0PAQH/BAQDAgGGMA8GA1UdEwEB/wQFMAMB
Af8wKQYDVR0OBCIEIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMCsG
A1UdIwQkMCKAIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMD8GA1Ud
EQQ4MDaGNHNwaWZmZTovLzExMTExMTExLTIyMjItMzMzMy00NDQ0LTU1NTU1NTU1
NTU1NS5jb25zdWwwCgYIKoZIzj0EAwIDSQAwRgIhAL+DxYjgyXRypuPr/e40/N3o
SsfL4nRMauawk3jQ522SAiEA7g9Hm08JMhRzWXIq1sitqBFj38QvGlETk6wGhJ1F
ZPo=
-----END CERTIFICATE-----
[-----BEGIN CERTIFICATE-----
MIICDDCCAbOgAwIBAgIBCTAKBggqhkjOPQQDAjAwMS4wLAYDVQQDEyVwcmktbGNs
OHc4Zi5jb25zdWwuY2EuMTExMTExMTEuY29uc3VsMB4XDTIwMDEyNDIyNDMyMVoX
DTIwMDEzMTIyNDMyMVowMDEuMCwGA1UEAxMlcHJpLWxiMHp4dmguY29uc3VsLmNh
LjExMTExMTExLmNvbnN1bDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABBJ2ONXb
Bl227Dn39DRTO3Ls4Edy1ckZ+hnJxUsxdRQEjs6TBIVD4sU+cXn/MxtwYsyFktGQ
VQm3Oqz4vFoJVYqjgb0wgbowDgYDVR0PAQH/BAQDAgGGMA8GA1UdEwEB/wQFMAMB
Af8wKQYDVR0OBCIEIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMCsG
A1UdIwQkMCKAIKzlTQpV/BVh64+siPPjHDe6Q+7jZtlPkVWE/aWDp5PoMD8GA1Ud
EQQ4MDaGNHNwaWZmZTovLzExMTExMTExLTIyMjItMzMzMy00NDQ0LTU1NTU1NTU1
NTU1NS5jb25zdWwwCgYIKoZIzj0EAwIDRwAwRAIgXvFTEEKMRg396PtOP1OQXH30
RjkofJ1WVfjDaJZ5qOwCIG4RdYZrw67/GdLjBDXrLdhvvK42JI8U2+QnfPnsK847
-----END CERTIFICATE-----
] true {0 0 <nil>} ec 256 {16 18}})]" should have 2 item(s), but has 1
Test: TestConnectCAConfig_TriggerRotation
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516744 [INFO] consul: shutting down server
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516772 [DEBUG] leader: stopping CA root pruning routine
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516781 [WARN] serf: Shutdown without a Leave
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516812 [DEBUG] leader: stopped CA root pruning routine
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.534273 [WARN] serf: Shutdown without a Leave
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.541301 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.554783 [INFO] manager: shutting down
FAIL agent/consul.TestConnectCAConfig_TriggerRotation (1.45s)
Overview of the Issue
https://github.com/hashicorp/consul/blob/b6219106188717a8813686030d6ddce256a9b3c6/agent/consul/connect_ca_endpoint_test.go#L253
Strangely enough this only seems to have a chance of breaking when running all of the agent/consul tests at once... could be shared state between them or some kind of resource starvation? Need to run it again w/
make test-flake
.Reproduction Steps
Test flake first occurred running the entire ./agent/consul suite
gotestsum -f short-verbose -- ./agent/consul -timeout 0
Running:
gotestsum -f short-verbose -- ./agent/consul -run TestConnectCAConfig_TriggerRotation -timeout 0 -count 1000
was not enough to get a fail. Need to try to repro with make test-flakeOperating system and Environment details
macOS 10.14.6
Log Fragments
Logs from local run: