moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.34k stars 611 forks source link

Possibly flakey test(s) github.com/docker/swarmkit/manager/state/raft #2226

Open ijc opened 7 years ago

ijc commented 7 years ago

While testing #2221/#2224 I saw this exactly once, out of 20 iterations, when running with #2221 and an updated grpc. I did not see it when testing master (20 times) or in #2222+#2223+#2224+updated grpc (20 times).

The first failure was TestRaftRestartClusterStaggered followed by (perhaps a cascading failure) TestGCWAL, TestRaftRestartClusterSimultaneously then TestRaftEncryptionKeyRotationStress.

time="2017-06-08T10:42:50+01:00" level=warning msg="no certificate expiration specified, using default" cluster.id=wi78qmd94o16fm0gub1ll1eo5 method="(*Server).UpdateRootCA" 
2017-06-08 10:42:51.400360 E | snap: failed to remove broken snapshot file /tmp/test-raft106872513/snap-v3-encrypted/0000000000000003-0000000000000008.snap
2017-06-08 10:42:51.671716 E | snap: failed to remove broken snapshot file /tmp/test-raft106872513/snap-v3-encrypted/0000000000000003-0000000000000009.snap
2017-06-08 10:42:55.160822 W | wal: sync duration of 1.146225908s, expected less than 1s
2017-06-08 10:42:55.161333 W | wal: sync duration of 1.089977619s, expected less than 1s
2017-06-08 10:42:55.164095 W | wal: sync duration of 1.146200021s, expected less than 1s
2017-06-08 10:42:56.323358 I | wal: segmented wal file /tmp/test-raft028139820/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
2017-06-08 10:42:56.449621 I | wal: segmented wal file /tmp/test-raft116248201/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
2017-06-08 10:42:56.455358 I | wal: segmented wal file /tmp/test-raft839343441/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
2017-06-08 10:43:00.140303 W | wal: sync duration of 1.45415386s, expected less than 1s
2017-06-08 10:43:10.586552 W | wal: sync duration of 11.78176386s, expected less than 1s
2017-06-08 10:43:10.587502 W | wal: sync duration of 11.682331417s, expected less than 1s
2017-06-08 10:43:10.588319 W | wal: sync duration of 11.488239764s, expected less than 1s
2017-06-08 10:43:10.589391 W | wal: sync duration of 11.051145397s, expected less than 1s
2017-06-08 10:43:10.705648 W | wal: sync duration of 9.380844153s, expected less than 1s
2017-06-08 10:43:10.706155 W | wal: sync duration of 9.82846409s, expected less than 1s
2017-06-08 10:43:10.707149 W | wal: sync duration of 9.387910459s, expected less than 1s
2017-06-08 10:43:10.707807 W | wal: sync duration of 8.475760419s, expected less than 1s
2017-06-08 10:43:10.709476 W | wal: sync duration of 2.673798517s, expected less than 1s
2017-06-08 10:43:10.710804 W | wal: sync duration of 5.032587115s, expected less than 1s
2017-06-08 10:43:10.712900 W | wal: sync duration of 9.827482872s, expected less than 1s
2017-06-08 10:43:10.713135 W | wal: sync duration of 10.11063637s, expected less than 1s
2017-06-08 10:43:10.713655 W | wal: sync duration of 9.832549676s, expected less than 1s
2017-06-08 10:43:10.715498 W | wal: sync duration of 7.338432017s, expected less than 1s
2017-06-08 10:43:10.717089 W | wal: sync duration of 9.395159342s, expected less than 1s
2017-06-08 10:43:10.720401 W | wal: sync duration of 5.050440243s, expected less than 1s
--- FAIL: TestRaftRestartClusterStaggered (13.82s)
        <autogenerated>:12: ^M                        ^M        Error Trace:    testutils.go:89
                ^M                      raft_test.go:600
                ^M                      raft_test.go:646
                ^M      Error:          Received unexpected error did not find a ready leader in member list
                        github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster.func1
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:73
                        github.com/docker/swarmkit/testutils.PollFuncWithTimeout
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:22
                        github.com/docker/swarmkit/testutils.PollFunc
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:36
                        github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:88
                        github.com/docker/swarmkit/manager/state/raft_test.testRaftRestartCluster
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft_test.go:600
                        github.com/docker/swarmkit/manager/state/raft_test.TestRaftRestartClusterStaggered
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft_test.go:646
                        testing.tRunner
                                /usr/lib/go-1.8/src/testing/testing.go:657
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                        github.com/docker/swarmkit/testutils.PollFuncWithTimeout
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:28: polling failed

2017-06-08 10:43:11.324673 I | wal: segmented wal file /tmp/test-raft284649420/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
2017-06-08 10:43:11.511113 I | wal: segmented wal file /tmp/test-raft223302878/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
2017-06-08 10:43:11.522924 I | wal: segmented wal file /tmp/test-raft046398403/wal-v3-encrypted/0000000000000001-000000000000000d.wal is created
--- FAIL: TestGCWAL (21.89s)
        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:387
                ^M      Error:          Received unexpected error raft: failed to process the request: node lost leader status
                        github.com/docker/swarmkit/manager/state/raft.init
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft.go:51
                        main.init
                                github.com/docker/swarmkit/manager/state/raft/_test/_testmain.go:100
                        runtime.main
                                /usr/lib/go-1.8/src/runtime/proc.go:173
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                ^M      Messages:       failed to propose value

--- FAIL: TestRaftRestartClusterSimultaneously (16.23s)
        <autogenerated>:12: ^M                        ^M        Error Trace:    testutils.go:89
                ^M                      raft_test.go:600
                ^M                      raft_test.go:638
                ^M      Error:          Received unexpected error did not find a ready leader in member list
                        github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster.func1
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:73
                        github.com/docker/swarmkit/testutils.PollFuncWithTimeout
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:22
                        github.com/docker/swarmkit/testutils.PollFunc
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:36
                        github.com/docker/swarmkit/manager/state/raft/testutils.WaitForCluster
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:88
                        github.com/docker/swarmkit/manager/state/raft_test.testRaftRestartCluster
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft_test.go:600
                        github.com/docker/swarmkit/manager/state/raft_test.TestRaftRestartClusterSimultaneously
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/manager/state/raft/raft_test.go:638
                        testing.tRunner
                                /usr/lib/go-1.8/src/testing/testing.go:657
                        runtime.goexit
                                /usr/lib/go-1.8/src/runtime/asm_amd64.s:2197
                        github.com/docker/swarmkit/testutils.PollFuncWithTimeout
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:28: polling failed

--- FAIL: TestRaftEncryptionKeyRotationStress (25.77s)
        <autogenerated>:12: ^M                        ^M        Error Trace:    storage_test.go:782
                ^M      Error:          Received unexpected error not done rotating yet
                        github.com/docker/swarmkit/testutils.PollFuncWithTimeout
                                /home/ijc/development/Docker/Swarm/gocode/src/github.com/docker/swarmkit/testutils/poll.go:28: polling failed

FAIL
FAIL    github.com/docker/swarmkit/manager/state/raft   31.481s

Full test-fix.18.log

Note that there is also an instance of #2225 in this log.

aaronlehmann commented 7 years ago

I'm wondering if this is reliably connected to #2221/#2224. The wal: sync duration log messages suggest that the system is taking a very long time to fsync, which can cause failures. We see this in CI sometimes (I believe it uses networked storage). Any idea if the system might have been under I/O load at that time?

ijc commented 7 years ago

I didn't mean to suggest this was related to #2221/#2224, sorry, was just trying to give context for what code I was running when I saw it vs when I didn't. I expect this is just a 1/60 failure rate.

It's my development laptop (a 4th gen Carbon X1 with an SSD) and I did leave those tests running while I did other stuff (because 20x iterations takes quite a long time). I was quite likely starting quite a few VMs at various points (sucking up RAM and likely causing swapping) and my Firefox was really crawling yesterday because its RSS was pretty huge, so yes, there was likely to be some I/O load (for all 3x20 cases I referred to).

aaronlehmann commented 7 years ago

CI now uses tmpfs for these files, so this shouldn't happen in actual CI runs.