Spark tests have race conditions

hpc / charliecloud

Lightweight user-defined software stacks for high-performance computing.

https://hpc.github.io/charliecloud

Apache License 2.0

311 stars 61 forks source link

Spark tests have race conditions #54

Open reidpr opened 6 years ago

reidpr commented 6 years ago

The Spark tests contain a number of race conditions where we start up processes in the background, wait a bit, and check once for expected output in the logs. If the output isn't present, we fail the test even if the expected output appeared later.

This often cause spurious Travis failures that can be resolved by simply re-running the job (and re-running the race).

Proposed solution: Put the output checks in a loop with a timeout.

reidpr commented 3 years ago

Note this hasn't caused CI to fail in a long time, IIRC.

reidpr commented 2 years ago

I looked into this a bit. Currently we have two sleeps on the machine running the tests, waiting for the master to start up and shut down; for these, it's straightforward to make a grep loop. We also have a sleep that can run on multiple nodes with pdsh; for these, it's not clear to me how to do the grep loop since I didn't figure out whether the remote grep's exit code will make its way back to the caller.

reidpr commented 2 years ago

The sleepcat function I'm adding in ch-run_join.bats might also help.