oar-team / oar

OAR is a versatile resource and task manager (also called a batch scheduler) for clusters and other computing infrastructures.
http://oar.imag.fr/
GNU General Public License v2.0
43 stars 22 forks source link

Do not suspect a node when an error is due to the connection to the deploy/cosystem frontend #48

Open npf opened 9 years ago

npf commented 9 years ago

If job is cosystem or deploy, and there is an error to connect to the cosystem or deploy frontend, the following message is shown and the first node of the job is suspected.

 server |       oar.log : [debug] [2015-07-05 00:19:11.897] [bipbip 27] execute oarexec on node 127.0.0.1
  server |       oar.log : ssh: connect to host 127.0.0.1 port 6667: Connection refused
  server |       oar.log : [debug] [2015-07-05 00:19:11.903] [bipbip 27] Job 27 is ended
  server |       oar.log : [debug] [2015-07-05 00:19:11.917] [bipbip 27] error of oarexec, exit value = 255; the job 27 is in Error and the node node3 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started

There is actually no reason to suspect that node.

npf commented 9 years ago

no obvious fix, see later (2.6)