Open reidpr opened 6 years ago
Note this hasn't caused CI to fail in a long time, IIRC.
I looked into this a bit. Currently we have two sleeps on the machine running the tests, waiting for the master to start up and shut down; for these, it's straightforward to make a grep
loop. We also have a sleep that can run on multiple nodes with pdsh
; for these, it's not clear to me how to do the grep
loop since I didn't figure out whether the remote grep
's exit code will make its way back to the caller.
The sleepcat
function I'm adding in ch-run_join.bats
might also help.
The Spark tests contain a number of race conditions where we start up processes in the background, wait a bit, and check once for expected output in the logs. If the output isn't present, we fail the test even if the expected output appeared later.
This often cause spurious Travis failures that can be resolved by simply re-running the job (and re-running the race).
Proposed solution: Put the output checks in a loop with a timeout.