ow2-proactive / scheduling

Multi-platform Scheduling and Workflows Engine
http://www.activeeon.com/workflows-scheduling
GNU Affero General Public License v3.0
62 stars 54 forks source link

After reconnection, RMNode is killed by RM #1986

Closed activeeon-bot closed 11 months ago

activeeon-bot commented 9 years ago

Original issue created by Mauricio Jost on 19, Jan 2015 at 14:41 PM - SCHEDULING-2222


To reproduce:

Take release 6.1.0. Start the server (using pnp) in host A. Start a node in host B. Disconnect the network between host A and B. Wait until the RMNode notices the failure and tries to reconnect. Re-connect the network. In some cases the node will just die.

activeeon-bot commented 9 years ago

Attachment:

activeeon-bot commented 9 years ago

Original comment posted by Mauricio Jost on 23, Jan 2015 at 14:53 PM


Additional information, this bug seems to have been introduced in 6.1.0. I run a couple of tests in 6.0.1 and node re-connection seems to work just fine.

activeeon-bot commented 9 years ago

Original comment posted by Mauricio Jost on 23, Jan 2015 at 16:28 PM


Using scheduling e7e08538c4de0c8147737d79333370870fa1c268.

I did 4 attempts (disconnecting and reconnecting rm and rmnode). I saw the bug only once. Logs on the server side look like:

[2015-01-23 16:22:27,768 WARN          o.o.p.r.u.ClientPinger] Client "rm" is down.
[2015-01-23 16:22:27,776 INFO                o.o.p.r.c.RMCore] "rm" disconnected from HalfBody_pa.stub.org.ow2.proactive.resourcemanager.nodesource.dataspace._StubDataSpaceNodeConfigurationAgent#configureNode_92574
[2015-01-23 16:22:28,121 INFO            o.o.p.r.n.NodeSource] [LocalNodes] Pinging alive nodes
[2015-01-23 16:22:28,126 DEBUG           o.o.p.r.n.NodeSource] Node pnp://172.16.50.1:37003/local-LocalNodes-2 is alive
[2015-01-23 16:22:28,126 DEBUG           o.o.p.r.n.NodeSource] Node pnp://172.16.50.1:55557/local-LocalNodes-1 is alive
[2015-01-23 16:22:28,127 DEBUG           o.o.p.r.n.NodeSource] Node pnp://172.16.50.1:48381/local-LocalNodes-3 is alive
[2015-01-23 16:22:28,128 DEBUG           o.o.p.r.n.NodeSource] Node pnp://172.16.50.1:50386/local-LocalNodes-0 is alive
[2015-01-23 16:22:48,200 INFO            o.o.p.r.n.NodeSource] [Default] Pinging alive nodes
[2015-01-23 16:22:57,262 WARN                  p.remoteobject] Node Source threadpool # 0 #7-thread-5 : unable to contact remote object [pnp://ubuntu-virtual-machine.local:33688/ubuntu-virtual-machine_37656] when calling method getActiveObjects
org.objectweb.proactive.core.exceptions.IOException6: Failed to send PNP message to pnp://ubuntu-virtual-machine.local:33688/ubuntu-virtual-machine_37656
        at org.objectweb.proactive.extensions.pnp.PNPROMessage.send(PNPROMessage.java:117)
        at org.objectweb.proactive.extensions.pnp.PNPRemoteObject.receiveMessage(PNPRemoteObject.java:82)
        at org.objectweb.proactive.core.remoteobject.RemoteObjectSet.receiveMessage(RemoteObjectSet.java:205)
        at org.objectweb.proactive.core.remoteobject.RemoteObjectAdapter.receiveMessage(RemoteObjectAdapter.java:151)
        at org.objectweb.proactive.core.remoteobject.SynchronousProxy.reify(SynchronousProxy.java:78)
        at pa.stub.org.objectweb.proactive.core.runtime._StubProActiveRuntime.getActiveObjects(_StubProActiveRuntime.java)
        at org.objectweb.proactive.core.runtime.ProActiveRuntimeRemoteObjectAdapter.getActiveObjects(ProActiveRuntimeRemoteObjectAdapter.java:120)
        at org.objectweb.proactive.core.node.NodeImpl.getNumberOfActiveObjects(NodeImpl.java:176)
        at org.ow2.proactive.resourcemanager.nodesource.NodeSource$1.run(NodeSource.java:706)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.objectweb.proactive.extensions.pnp.exception.PNPHeartbeatTimeoutException: Hearthbeat not received in time (9000 ms)
        at org.objectweb.proactive.extensions.pnp.PNPAgent$Parking.unlockDueToDisconnection(PNPAgent.java:627)
        at org.objectweb.proactive.extensions.pnp.PNPAgent$Parking.run(PNPAgent.java:617)
        at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:546)
        at org.jboss.netty.util.HashedWheelTimer$Worker.notifyExpiredTimeouts(HashedWheelTimer.java:446)
        at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:395)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        ... 1 more
[2015-01-23 16:22:57,274 INFO            o.o.p.r.n.NodeSource] [Default] Detected down node pnp://ubuntu-virtual-machine.local:33688/ubuntu-virtual-machine_37656
[2015-01-23 16:22:57,274 INFO  i.DefaultInfrastructureManager] Terminating the node ubuntu-virtual-machine_37656
[2015-01-23 16:22:57,276 INFO  i.DefaultInfrastructureManager] Terminating the runtime pnp://ubuntu-virtual-machine.local:33688/PA_JVM1510426333
[2015-01-23 16:22:57,278 INFO                o.o.p.r.c.RMCore] The node pnp://ubuntu-virtual-machine.local:33688/ubuntu-virtual-machine_37656 provided by "rm" is down
[2015-01-23 16:23:13,122 INFO            o.o.p.r.n.NodeSource] [LocalNodes] Pinging alive nodes
activeeon-bot commented 9 years ago

Original comment posted by Mauricio Jost on 23, Jan 2015 at 18:51 PM


The NodeSource executes regularly pingNode. If the node is detected as down, the NodeSource will execute detectedPingedDownNode and execute infrastructureManager.internalRemoveNode(...) once. This will execute removeNode in the chosen InfrastructureManager once. If the InfrastructureManager is DefaultInfrastructureManager it will send a killRT to the node. If at this point the node has just re connected, or the network caches the request, the node will be killed and its process will be forced to exit, just after re-connection. My tests were done on VMs, and network disconnection was hence simulated by disconnecting the VMs' network. Maybe this helps the caching of packages, but not sure. In any case, if this happens in virtual environments the node has a high chance of being killed.

fviale commented 8 years ago

I don't really understand this issue. For me this is only a problem of configuration. In config/scheduler/settings.ini, the properties pa.scheduler.core.nodepingfrequency and pa.scheduler.core.node.ping.attemps are supposed to control the delay to which the scheduler will see a Node as down (and reschedule a new task).

When the reconnection mechanism was implemented, it was stated that the node ping frequency of the scheduler must absolutely be augmented to allow reconnection. In other words, how can we at the same time guaranty (node failure detections / task rescheduling) and (node reconnection with result preservation) ? The two scenarios are completely disjoint and cannot work together : 1) Node reconnection / result preservation is useful for very specific kind of tasks (lasting hours or so) 2) node failure detections is useful for more dynamic environments

What could be done to improve the black or white configuration is to allow a task to specify its ping frequency (as actually this ping is done by the scheduler, in order to get the task progress). This way we can allow a more fine-grained control.

tobwiens commented 8 years ago

This seems to be one reason why tests fail when they run as a whole. I think @paraita is investigating this. If we can solve this by configuration :+1:.

Why do we remove nodes after such a short timeout? Most software I worked with had a 24 hours timeout on failing nodes in a distributed system. Why is ours so short?

fviale commented 8 years ago

Well, it all depends on the type of application running. If we want more dynamicity, we need a lower timeout (and it is good for demos). I think jenkins has a low timeout for example.

But I agree that for long lasting tasks, a short timeout is not well suited.

tobwiens commented 8 years ago

I agree that task timeouts are debatable and it is good that they are adjustable. But why does it take action and removes and kills a node after the same timeout?

The example I have in my mind is that if you have 3 hours network outage, your tasks will be re-scheduled onto infrastructure which is not "timed-out" but after three ours the nodes just start communicating again. So why even bothering to kill or not kill a node on the server side?

fviale commented 8 years ago

You have a good point. To kill, or not to kill, that is the question...

I don't know for sure at this point.