radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Some checks to avoid job hanging? #123

Closed wjlei1990 closed 4 years ago

wjlei1990 commented 4 years ago

Yesterday I was having a issue with a job.

The situation is that I forgot to load one module in the entk script. However, when I lauch the ENTK job script, the job just run normally without interruption. However, due to the missing module, all the tasks just keep buring hours without being killed.

I am a bit worried if in the future, if the system settings changed and I didn't realized the change and submit jobs, I would waste the resource on summit. I have encountered such cases a few times already.

Do you know if there is any way to prevent such cases? I feel if the job is submitted using lsf script, and there is a module missing, then the lsf will kill the job immediately. Correct me if I am wrong.

One thing I am thinking about is to launch 1 small task and if it runs succesfully, then I launch the large job. I think this will work as a small walk-around to the job hanging issue.

If you want to replicate this issue, I may provide your with the script and executable.

I think Lucas also had similiar issues on Tranvserse. Also, I had another issue ticket #109. It is because I am assigning the resource incorrectly and the job hangs without doing anything but buring hours.

wjlei1990 commented 4 years ago

One idea came to my mind is to have some checks after each task in run. For example, if I have 100 task submitted. The ENTK have run 10 of them and all of them have failed. I will then need to kill this whole job.

Not sure if ENTK is capable of doing it?

andre-merzky commented 4 years ago

You should get notifications for each task state change, and can count the failing tasks, and take action on a certain threshold? I am not too familiar with RE's API, maybe @mturilli or @lee212 can comment, but I think you have to register a function as post_exec on a task or pipeline instance to get the callback mechanism.

Would that address your use case?

lee212 commented 4 years ago

I'd like to echo the post_exec feature with the example here: https://radicalentk.readthedocs.io/en/latest/adv_examples/adapt_tc.html

It has an evaluation function after a stage is finished to decide to continue or not. Based on your description, run 10 tasks in a stage with post_exec to determine whether it kills a job or continue the rest of 90 tasks.

wjlei1990 commented 4 years ago

I will try to use some warm up jobs to check the job status and updated to you.