stephenslab / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
12 stars 12 forks source link

Possible connection issues with `slurm / PBS` #192

Closed gaow closed 5 years ago

gaow commented 5 years ago

Sometimes tasks fail for no apparent reason, even when the actual computation is done without issues. This is because the task manager doesnt hear back from PBS, possibly due to loss of connection, and thus assuming the job was not done or output is corrupted. It would be best to figure out why this happens before solving it. Current blocker is that we cannot stably reproduce it (well I have not yet reproduced, this is from a users report).

pcarbo commented 5 years ago

As I said on Slack, I would focus less on trying to anticipate ways Slurm could go wrong, and try to recover from this (because it is near impossible), and instead focus on trying to make as much progress as possible in spite of these failures.

gaow commented 5 years ago

I agree -- it is very hard to even reproduce it. We might also try, in the future, to use some more robust ways to communicate with slurm. That will involve changes to SoS task manager. Currently it does something in between snakemake (direct slurm interaction based on job templates) and nextflow (has its own job queue to interact with more generic computing systems).

I'll use the other ticket to track behavior on failure.