Problem with jobs that have been queuing for a long time

radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.

Other

4 stars 3 forks source link

Problem with jobs that have been queuing for a long time #73

Open haoyuanchen opened 8 years ago

haoyuanchen commented 8 years ago

When running GPU jobs on Stampede, if the job has been queued for a long time (~ 1 day, which is quite common if using more than 4 gpu nodes because gpudev queue won't be available then), then the job doesn't function properly. Normally, some necessary files are not generated/transferred, which stops the job from running properly.

andre-merzky commented 8 years ago

We indeed have timeout problems in RP: https://github.com/radical-cybertools/radical.pilot/issues/645

andre-merzky commented 8 years ago

also https://github.com/radical-cybertools/radical.pilot/issues/442

haoyuanchen commented 8 years ago

Thanks! Moreover, I haven't seen this happen when I was running CPU jobs several months ago, some of which also queued for a long time.

andre-merzky commented 8 years ago

Running on GPUs vs. CPUs should not make a difference from the RP perspective (I can't speak for the Repex layer though). But the timeout issues are not very deterministic, and depend (amongst others) on system settings and network quality. I am not sure when we will get around to fix that thoroughly :(

shantenujha commented 8 years ago

Will this get resolved with client refactoring?

marksantcroos commented 8 years ago

Not automagically, but it will streamline the database interaction and therefore make it easier to address this.

ibethune commented 7 years ago

Backburner, along with related RP ticket https://github.com/radical-cybertools/radical.pilot/issues/1129