hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.1k stars 4.4k forks source link

unflake agent/consul.TestConnectCAConfig_TriggerRotation #7131

Open mkcp opened 4 years ago

mkcp commented 4 years ago

Overview of the Issue

https://github.com/hashicorp/consul/blob/b6219106188717a8813686030d6ddce256a9b3c6/agent/consul/connect_ca_endpoint_test.go#L253

Strangely enough this only seems to have a chance of breaking when running all of the agent/consul tests at once... could be shared state between them or some kind of resource starvation? Need to run it again w/ make test-flake.

Reproduction Steps

Test flake first occurred running the entire ./agent/consul suite gotestsum -f short-verbose -- ./agent/consul -timeout 0

Running: gotestsum -f short-verbose -- ./agent/consul -run TestConnectCAConfig_TriggerRotation -timeout 0 -count 1000 was not enough to get a fail. Need to try to repro with make test-flake

Operating system and Environment details

macOS 10.14.6

Log Fragments

Logs from local run:

=== RUN   TestConnectCAConfig_TriggerRotation
=== PAUSE TestConnectCAConfig_TriggerRotation
=== CONT  TestConnectCAConfig_TriggerRotation
--- FAIL: TestConnectCAConfig_TriggerRotation (1.45s)
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.179050 [DEBUG] tlsutil: Update with version 1
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.179628 [DEBUG] tlsutil: OutgoingRPCWrapper with version 1
    writer.go:21: 2020/01/24 14:44:20 [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:7c87f8f5-0ada-2ccc-14e6-7b6158365dcf Address:127.0.0.1:12011}]"
    writer.go:21: 2020/01/24 14:44:20 [INFO]  raft: entering follower state: follower="Node at 127.0.0.1:12011 [Follower]" leader=
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.472580 [INFO] serf: EventMemberJoin: TestConnectCAConfig_TriggerRotation-node-244.dc1 127.0.0.1
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479788 [INFO] serf: EventMemberJoin: TestConnectCAConfig_TriggerRotation-node-244 127.0.0.1
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479945 [INFO] consul: Adding LAN server TestConnectCAConfig_TriggerRotation-node-244 (Addr: tcp/127.0.0.1:12011) (DC: dc1)
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.479964 [INFO] consul: Handled member-join event for server "TestConnectCAConfig_TriggerRotation-node-244.dc1" in area "wan"
    writer.go:21: 2020/01/24 14:44:20 [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
    writer.go:21: 2020/01/24 14:44:20 [INFO]  raft: entering candidate state: node="Node at 127.0.0.1:12011 [Candidate]" term=2
    writer.go:21: 2020/01/24 14:44:20 [INFO]  raft: election won: tally=1
    writer.go:21: 2020/01/24 14:44:20 [INFO]  raft: entering leader state: leader="Node at 127.0.0.1:12011 [Leader]"
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.889094 [INFO] consul: cluster leadership acquired
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:20.889199 [INFO] consul: New leader elected: TestConnectCAConfig_TriggerRotation-node-244
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.022334 [DEBUG] tlsutil: OutgoingRPCWrapper with version 1
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.023422 [DEBUG] consul CA provider configured ID=07:80:c8:de:f6:41:86:29:8f:9c:b8:17:d6:48:c2:d5:c5:5c:7f:0c:03:f7:cf:97:5a:a7:c1:68:aa:23:ae:81 IsPrimary=true
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174448 [INFO] connect: initialized primary datacenter CA with provider "consul"
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174514 [INFO] leader: started CA root pruning routine
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174620 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.174652 [INFO] consul: member 'TestConnectCAConfig_TriggerRotation-node-244' joined, marking health alive
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.233543 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.262598 [DEBUG] consul CA provider configured ID=48:45:95:7e:ac:ba:93:c0:ab:a1:ab:2f:63:48:1d:fa:55:ec:5e:92:79:5c:57:16:10:56:9b:06:0b:86:7c:63 IsPrimary=true
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.375184 [INFO] connect: pruning old unused root CA (ID: 96:26:f0:4d:d9:42:d4:93:ff:d4:bd:4a:a5:91:69:03:11:df:f5:4c)
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.446295 [INFO] connect: CA rotated to new root under provider "consul"
    connect_ca_endpoint_test.go:306: 
            Error Trace:    connect_ca_endpoint_test.go:306
            Error:          "[%!s(*structs.CARoot=&{9e:84:70:af:1b:91:da:44:f5:b3:82:41:e9:05:be:a4:d7:f3:57:ba Consul CA Root Cert 8 81:3b:4a:7b:b4:2b:75:75:59:72:cc:bd:4a:bd:90:a5:2b:b0:e1:50:6f:ba:ad:73:8e:cf:16:f5:34:e9:5a:05 11111111-2222-3333-4444-555555555555 {0 63715502661 <nil>} {0 64031121861 <nil>} -----BEGIN CERTIFICATE-----
                            MIICDjCCAbOgAwIBAgIBCDAKBggqhkjOPQQDAjAwMS4wLAYDVQQDEyVwcmktbGIw
                            enh2aC5jb25zdWwuY2EuMTExMTExMTEuY29uc3VsMB4XDTIwMDEyNDIyNDQyMVoX
                            DTMwMDEyNDIyNDQyMVowMDEuMCwGA1UEAxMlcHJpLWxiMHp4dmguY29uc3VsLmNh
                            LjExMTExMTExLmNvbnN1bDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABBJ2ONXb
                            Bl227Dn39DRTO3Ls4Edy1ckZ+hnJxUsxdRQEjs6TBIVD4sU+cXn/MxtwYsyFktGQ
                            VQm3Oqz4vFoJVYqjgb0wgbowDgYDVR0PAQH/BAQDAgGGMA8GA1UdEwEB/wQFMAMB
                            Af8wKQYDVR0OBCIEIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMCsG
                            A1UdIwQkMCKAIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMD8GA1Ud
                            EQQ4MDaGNHNwaWZmZTovLzExMTExMTExLTIyMjItMzMzMy00NDQ0LTU1NTU1NTU1
                            NTU1NS5jb25zdWwwCgYIKoZIzj0EAwIDSQAwRgIhAL+DxYjgyXRypuPr/e40/N3o
                            SsfL4nRMauawk3jQ522SAiEA7g9Hm08JMhRzWXIq1sitqBFj38QvGlETk6wGhJ1F
                            ZPo=
                            -----END CERTIFICATE-----
                             [-----BEGIN CERTIFICATE-----
                            MIICDDCCAbOgAwIBAgIBCTAKBggqhkjOPQQDAjAwMS4wLAYDVQQDEyVwcmktbGNs
                            OHc4Zi5jb25zdWwuY2EuMTExMTExMTEuY29uc3VsMB4XDTIwMDEyNDIyNDMyMVoX
                            DTIwMDEzMTIyNDMyMVowMDEuMCwGA1UEAxMlcHJpLWxiMHp4dmguY29uc3VsLmNh
                            LjExMTExMTExLmNvbnN1bDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABBJ2ONXb
                            Bl227Dn39DRTO3Ls4Edy1ckZ+hnJxUsxdRQEjs6TBIVD4sU+cXn/MxtwYsyFktGQ
                            VQm3Oqz4vFoJVYqjgb0wgbowDgYDVR0PAQH/BAQDAgGGMA8GA1UdEwEB/wQFMAMB
                            Af8wKQYDVR0OBCIEIIE7Snu0K3V1WXLMvUq9kKUrsOFQb7qtc47PFvU06VoFMCsG
                            A1UdIwQkMCKAIKzlTQpV/BVh64+siPPjHDe6Q+7jZtlPkVWE/aWDp5PoMD8GA1Ud
                            EQQ4MDaGNHNwaWZmZTovLzExMTExMTExLTIyMjItMzMzMy00NDQ0LTU1NTU1NTU1
                            NTU1NS5jb25zdWwwCgYIKoZIzj0EAwIDRwAwRAIgXvFTEEKMRg396PtOP1OQXH30
                            RjkofJ1WVfjDaJZ5qOwCIG4RdYZrw67/GdLjBDXrLdhvvK42JI8U2+QnfPnsK847
                            -----END CERTIFICATE-----
                            ]   true {0 0 <nil>} ec 256 {16 18}})]" should have 2 item(s), but has 1
            Test:           TestConnectCAConfig_TriggerRotation
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516744 [INFO] consul: shutting down server
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516772 [DEBUG] leader: stopping CA root pruning routine
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516781 [WARN] serf: Shutdown without a Leave
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.516812 [DEBUG] leader: stopped CA root pruning routine
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.534273 [WARN] serf: Shutdown without a Leave
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.541301 [DEBUG] consul: Skipping self join check for "TestConnectCAConfig_TriggerRotation-node-244" since the cluster is too small
    log.go:172: TestConnectCAConfig_TriggerRotation-node-244 - 2020/01/24 14:44:21.554783 [INFO] manager: shutting down
FAIL agent/consul.TestConnectCAConfig_TriggerRotation (1.45s)
mkcp commented 4 years ago

Related #7042