google-code-export / yabi

Automatically exported from code.google.com/p/yabi
0 stars 1 forks source link

Failing sshd and yabi error retry #161

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Certain submission hosts may fall over. Their ssh may close the connection 
before a banner is sent during windows that may last 10 minutes or longer! In 
these cases, the ssh command carrying a job qstat request constantly fails and 
is retries repeatedly. Eventually after about 5 minutes yabi gives up. And 
reports an error on the job. But the job is actually still running, its just 
the head node that fell over. If we retry longer for weaker machines, then we 
wont see errors in dev for potentially a long time. If we retry in short time 
frames, long outages propagate upwards. We need a setting for this so dev 
systems can not retry much and production systems can retry for as long as 
needed.

Original issue reported on code.google.com by retrogra...@gmail.com on 1 Mar 2012 at 2:39

GoogleCodeExporter commented 9 years ago
Fixed in yabibe-release-5.14.1 on branch yabibe-5.14-1

Implemented a retry window time setting in yabi.conf. Under [taskmanager] 
section set the time to retry for with the 'retrywindow' setting. Use large 
values for production server. Value is in seconds.

Original comment by retrogra...@gmail.com on 1 Mar 2012 at 6:38