Open ijc opened 7 years ago
I'm wondering if this is reliably connected to #2221/#2224. The wal: sync duration
log messages suggest that the system is taking a very long time to fsync, which can cause failures. We see this in CI sometimes (I believe it uses networked storage). Any idea if the system might have been under I/O load at that time?
I didn't mean to suggest this was related to #2221/#2224, sorry, was just trying to give context for what code I was running when I saw it vs when I didn't. I expect this is just a 1/60 failure rate.
It's my development laptop (a 4th gen Carbon X1 with an SSD) and I did leave those tests running while I did other stuff (because 20x iterations takes quite a long time). I was quite likely starting quite a few VMs at various points (sucking up RAM and likely causing swapping) and my Firefox was really crawling yesterday because its RSS was pretty huge, so yes, there was likely to be some I/O load (for all 3x20 cases I referred to).
CI now uses tmpfs for these files, so this shouldn't happen in actual CI runs.
While testing #2221/#2224 I saw this exactly once, out of 20 iterations, when running with #2221 and an updated grpc. I did not see it when testing master (20 times) or in #2222+#2223+#2224+updated grpc (20 times).
The first failure was
TestRaftRestartClusterStaggered
followed by (perhaps a cascading failure)TestGCWAL
,TestRaftRestartClusterSimultaneously
thenTestRaftEncryptionKeyRotationStress
.Full test-fix.18.log
Note that there is also an instance of #2225 in this log.