gridkit / nanocloud

NanoCloud - distributed computing toolkit
59 stars 11 forks source link

Tunneller processes remain on remote host #17

Closed terjebol closed 6 years ago

terjebol commented 7 years ago

Hello.

We're using nanocloud in our project, and mostly we have had success with it.

However, we keep encountering issues where the connections are not completely shut down. On the remote side, the tunneller process (tunneller.jar) remains even after nodes & cloud has been shut down from the local side. This causes a resource leak where the remote server eventually runs out of available threads and ssh connections.

Is this a known issue, and if so, does anyone have a suggestion on how to fix it?

We've tried to debug, but it would be helpful (if there is no known fix) if you could point us in the direction of the class that actually has the responsibility to shut down processes on the remote host.

Thanks. -Terje

aragozin commented 7 years ago

Hi,

I got few reports describing problem like yours, but never got a chance to reproduce it myself. Could you provide more details about OS/Java/Nanocloud version you are using?

Do you have specific scenario for reproducing this problem?

Regards, Alexey

terjebol commented 7 years ago

Hi.

We're running on Redhat 7.4, Java 8, (1.8.0_121) and the latest version of nanocloud.

I don't have any specific scenario to reproduce the problem, it just happens from time to time. We're running some long-running algorithms though. Several hours at least.

The usual process tree at the remote host is openssh/sftp -> tunneller -> booter -> our code. However sometimes we see the process tree is empty below the tunneller, and the tunneller has no longer any contact with the local server. Our code is also in these instances still running, although as a top-level process, not within the normal process tree.

We notice some challenges sometimes when submitting new tasks as well. Calling node.touch() sometimes hangs (somewhere around ProcessSporeLauncher if I remember correctly). I believe the tunneller has died in these cases, so we work around it by adding a timeout to touch(), and then shutting down the cloud, and opening a new one for new tasks.

When killing a "empty" tunneller process on the remote host, we get this exception:

Exception in thread "InboundDemux:tunneller" java.lang.NullPointerException
at org.gridkit.vicluster.telecontrol.bootstraper.TunnellerConnection.shutdown(TunnellerConnection.java:180)
at org.gridkit.vicluster.telecontrol.bootstraper.TunnellerIO$InboundDemux.run(TunnellerIO.java:592)
at org.gridkit.vicluster.telecontrol.bootstraper.TunnellerConnection$1.run(TunnellerConnection.java:82) 

Not sure if it helps, but let me know if you can think of something, or have some tricks we can try :)

Thanks, Terje

aragozin commented 7 years ago

Could you capture thread dumps on tunneller and it's owner process?

terjebol commented 7 years ago

I'll try to get into the failing state to capture dumps.

Meanwhile, I think much of the root reason for this happening is inProcessSporeLauncher, line 146,

InetSocketAddress sockAddr = (InetSocketAddress)fget(session.bindAddress);

which sometimes fails. Sometimes it never returns, and sometimes it throws an InterruptedException: null. We work around it never returning by adding an outer timeout, but whenever it times out or is interrupted, we need to instantiate a new cloud, and the tasks running on the "old" cloud sometimes finish, and sometimes remain hanging. (Same for the tunneler.jar process on the remote host).

I have noticed the pattern for TunnellerConnectionwhen starting nodes is BoundCmd-> StartedCmd-> AcceptedCmdand until the node finishes with a ExitCodeCmd.

However, when our problem occurs, there is no BoundCmd, and no StartedCmd, the first command it sends is AcceptedCmd. This pattern seems to be pretty consistent when the problem above occurs.

Regards, Terje

aragozin commented 7 years ago

I have committed two patches (no release version) so far.

Would you be able to test this version before release to Maven central?

terjebol commented 7 years ago

Thanks Alexey.

We're testing the patches now. It's hard to verify that it works, since the problems we encounter seem indetermenistic, or at least very hard to reproduce. (It's a lot easier to verify if it fails :) ) We'll let it run for a few days, and I will let you know what our experience is.

Regards, Terje