hierynomus / sshj

ssh, scp and sftp for java
Apache License 2.0
2.51k stars 601 forks source link

Fatal error detected by Java runtime environment when invoking SocketClient.connect() #616

Open yonatankarimish opened 4 years ago

yonatankarimish commented 4 years ago

Hi

When working according to your comments on issue #611 , I was opening a large number of ssh clients using the following code:

// Static members
private static final Object clientLock = new Object();
private static final DefaultConfig sshConfig = new DefaultConfig();
//This section runs concurrently
try {
    this.sshClient = new SSHClient(sshConfig);
    sshClient.addHostKeyVerifier(new PromiscuousVerifier());
    synchronized (clientLock) {
        sshClient.connect(localhostConfig.getHost(), localhostConfig.getPort()); //this line crashes the jvm
        sshClient.authPassword(localhostConfig.getUsername(), localhostConfig.getPassword());
    }
} catch (UserAuthException e) {
    throw new IOException("Failed to authenticate local SSH connection while creating a new SSH client. Cauesed by: ", e);
} catch (TransportException e) {
    throw new IOException("Transport error experienced on SSH connection while creating a new SSH client. Cauesed by: ", e);
} catch (IOException e) {
    throw new IOException("Failed to open local SSH connection while creating a new SSH client. Cauesed by: ", e);
}

I ran this code in a multi-threaded environment, resulting in a large number of clients open at the same time. Around ~2500+ clients my JVM crashed unexpectedly with no crash logs. When checking my system logs, they pointed to a crash report from the Java process itself. The logs are quite long, so i'm attaching the relevant part here below.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8a99ae643f, pid=5095, tid=6800
#
# JRE version: OpenJDK Runtime Environment (11.0.1+13) (build 11.0.1+13-Ubuntu-3ubuntu116.04ppa1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.1+13-Ubuntu-3ubuntu116.04ppa1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libnet.so+0xf43f]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P" (or dumping to /sixsense/core.5095)
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5005 /sixsense/OperationEngine.jar

Host: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 4 cores, 7G, Ubuntu 16.04.5 LTS
Time: Sun Jul  5 15:59:10 2020 IDT elapsed time: 149 seconds (0d 0h 2m 29s)

---------------  T H R E A D  ---------------

Current thread (0x00007f8a4ccbc000):  JavaThread "Monitor(engine-worker-1631)" [_thread_in_native, id=6800, stack(0x00007f89bde7c000,0x00007f89bdf7d000)]

Stack: [0x00007f89bde7c000,0x00007f89bdf7d000],  sp=0x00007f89bdf791d0,  free space=1012k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libnet.so+0xf43f]
C  [libnet.so+0xc6ca]  Java_java_net_PlainSocketImpl_socketConnect+0x1fa
J 8376  java.net.PlainSocketImpl.socketConnect(Ljava/net/InetAddress;II)V java.base@11.0.1 (0 bytes) @ 0x00007f8ab0549b95 [0x00007f8ab0549ac0+0x00000000000000d5]
J 8264 c1 java.net.AbstractPlainSocketImpl.doConnect(Ljava/net/InetAddress;II)V java.base@11.0.1 (168 bytes) @ 0x00007f8aaa17bbac [0x00007f8aaa17b340+0x000000000000086c]
J 8261 c1 java.net.AbstractPlainSocketImpl.connect(Ljava/net/SocketAddress;I)V java.base@11.0.1 (118 bytes) @ 0x00007f8aaa1796e4 [0x00007f8aaa178be0+0x0000000000000b04]
J 8141 c1 java.net.SocksSocketImpl.connect(Ljava/net/SocketAddress;I)V java.base@11.0.1 (1542 bytes) @ 0x00007f8aaa12a1cc [0x00007f8aaa11e8e0+0x000000000000b8ec]
J 8000 c1 java.net.Socket.connect(Ljava/net/SocketAddress;I)V java.base@11.0.1 (248 bytes) @ 0x00007f8aaa0a4d5c [0x00007f8aaa0a36a0+0x00000000000016bc]
J 7743 c1 net.schmizz.sshj.SocketClient.connect(Ljava/lang/String;I)V (57 bytes) @ 0x00007f8aa9f60cac [0x00007f8aa9f608a0+0x000000000000040c]
j  com.sixsense.io.ShellChannel.<init>(Ljava/lang/String;Lcom/sixsense/config/HostConfig$Host;Lcom/sixsense/io/Session;)V+71

Any idea what i'm doing wrong / what could be causing this?

yonatankarimish commented 4 years ago

While testing the Socket.connect() method (without invoking it from your library) the JVM managed to maintain 65k+ open sockets: (the maximum open files I allocated the JVM)

// Static members
private static final SocketFactory socketFactory = SocketFactory.getDefault();
//This section runs concurrently
try {
    Socket socket = socketFactory.createSocket();
    InetSocketAddress address = new InetSocketAddress(sessionEngine.localhostConfig.getHost(), sessionEngine.localhostConfig.getPort());
    socket.connect(address, connectTimeout);
    Thread.sleep(preCloseWait);
    socket.close();
} catch (Exception e) {
    //whatever...
}
hierynomus commented 4 years ago

What is causing the core dump? Keep in mind that the lib does a lot more than "just opening a socket".

yonatankarimish commented 4 years ago

I didn't have core dumps enabled. But according to the logs I originally attached, invoking sshClient.connect(), which ultimately invoked socket.connect(), causing a segmentation fault in the JVM memory which resulted in a crash.

The culprit class is SocketClient, in which You first declare a socket factory [line 38]:

private SocketFactory socketFactory = SocketFactory.getDefault();

and the source code throwing the exception is the connect() method [line 126]:

public void connect(String hostname, int port) throws IOException {
        if (hostname == null) {
            connect(InetAddress.getByName(null), port);
        } else {
            this.hostname = hostname;
            socket = socketFactory.createSocket();
            socket.connect(new InetSocketAddress(hostname, port), connectTimeout);
            onConnect(); 
        }
    }

Since I don't pass a null hostname, and since your connectionTimeout = 0 (don't timeout the connection), this can be rewritten as:

public void connect(String hostname, int port) throws IOException {
    socket = socketFactory.createSocket();
    socket.connect(new InetSocketAddress(hostname, port), 0);
    onConnect(); 
}

making the only difference between my socket opening test and your method the onConnect() callback after the socket.connect() call.

I don't know what could be the cause of the segmentation fault. the onConnect() callback from a previous invocation? the sshClient? Thread-safety issues? To be honest, communication protocols are not my area of expertise, so I can only guess what could be causing the problem.

yonatankarimish commented 4 years ago

@hierynomus Checking again after 4 months... Any chance you can help me out with this issue?

hierynomus commented 4 years ago

Core dumps, or segv's of this kind are not in the library, also see the dump you posted. It's deep in C/native code. I cannot help here.