gridkit / nanocloud

NanoCloud - distributed computing toolkit
59 stars 11 forks source link

Killed tunneler process on remote node is not recoverable? #15

Closed MatsGA closed 6 years ago

MatsGA commented 8 years ago

Hello,

My team is enjoying nanocloud a lot, and at the moment we are trying to figure out how to recover from a dead tunneler process on remote nodes (if the remote machine has restarted or similar).

At the moment this gives the stack trace at the bottom of the issue. Is there a built-in mechanism to handle this we have missed or do we need to manage it ourselves? (by for instance shutting down the Cloud instance and re-initialize it)

ERROR org.gridkit.vicluster.ViNodeSet - ViNode[nodename] initialization has failed
java.lang.RuntimeException: java.io.IOException: Broken tunnel
    at org.gridkit.nanocloud.telecontrol.TunnellerControlConsole.openSocket(TunnellerControlConsole.java:149) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.nanocloud.telecontrol.SimpleTunnelInitiator$CosnoleWrapper.openSocket(SimpleTunnelInitiator.java:188) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.nanocloud.telecontrol.ProcessSporeLauncher.createProcess(ProcessSporeLauncher.java:142) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.telecontrol.GenericNodeTypeHandler$ProcessLauncherRule.apply(GenericNodeTypeHandler.java:417) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViEngine$InductiveRuleHook.rerun(ViEngine.java:862) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViEngineGame.play(ViEngineGame.java:83) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViEngine$Core.processPhase(ViEngine.java:301) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViEngine$Core.ignite(ViEngine.java:151) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.telecontrol.jvm.ViEngineNodeProvider.createNode(ViEngineNodeProvider.java:34) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViManager$ManagedNode.createNode(ViManager.java:559) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViManager$ManagedNode.access$900(ViManager.java:298) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.vicluster.ViManager$ManagedNode$InitTask.run(ViManager.java:570) ~[vicluster-core-0.8.11.jar:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_25]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_25]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_25]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]
Caused by: java.io.IOException: Broken tunnel
    at org.gridkit.vicluster.telecontrol.bootstraper.TunnellerConnection.newSocket(TunnellerConnection.java:112) ~[vicluster-core-0.8.11.jar:na]
    at org.gridkit.nanocloud.telecontrol.TunnellerControlConsole.openSocket(TunnellerControlConsole.java:146) ~[vicluster-core-0.8.11.jar:na]
    ... 16 common frames omitted
aragozin commented 8 years ago

Nanocloud is mostly following fail fast philosophy. Restarting slave server was never been a case I was thinking of. At the moment, all you can do is to start new cloud instance which will initialize fresh new tunneler. I will see if I could add autorecovery for tunneler, though all your slaves on restarted box will be lost anyway.

MatsGA commented 8 years ago

Thanks for the reply. We'll do as you suggest

aragozin commented 6 years ago

Tunneler recover added in 0.8.12