Rethink of the "fail-fast" notion

stephenslab / dsc

Repo for Dynamic Statistical Comparisons project

https://stephenslab.github.io/dsc-wiki

MIT License

12 stars 12 forks source link

Rethink of the "fail-fast" notion #190

Closed gaow closed 5 years ago

gaow commented 5 years ago

Currently in DSC, any failure in any specific module instance will cause failure to the entire benchmark. This is in light of "fail-fast" notion. It has been argued by some user that this should not be a good behavior. We should allow for it to run as much as it could and report failure after.

pcarbo commented 5 years ago

You should still observe the "fail fast" principle, but "fail fast" should be specific to each pipeline, not to the entire benchmark.

gaow commented 5 years ago

GNU Make does "fail-fast" the way similar to what we are doing, but it has an option --keep-going to try to finish as much as possible. So I guess both approaches has merits. I'm adding an interface to allow for both.

gaow commented 5 years ago

A --keep-going option is added to override the default fail-fast and try to complete as much as possible. More tests are needed before making a release.

jdblischak commented 5 years ago

As a reference, Snakemake also uses this convention.

--keep-going, -k  Go on with independent jobs if a job fails. Default: False

https://snakemake.readthedocs.io/en/stable/executable.html#all-options

I prefer this behavior because if I submit a long-running Snakemake job at the end of the day, I want it to know that even if some errors occur, it will still run as much as possible (this was an even bigger issue back on PPS, where 10% of my jobs would randomly fail for no reason).

pcarbo commented 5 years ago

It is good to have extra motivation for having this option.