hierynomus / sshj

ssh, scp and sftp for java
Apache License 2.0
2.46k stars 594 forks source link

False-alarm timeout exception when waiting for key exchange to complete #909

Open rasantel opened 7 months ago

rasantel commented 7 months ago

Occasionally, my sshj client hits a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.

Based on the logs and code, this can happen during the following sequence of events between the main thread and the reader thread:

[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.handle] Received SSH_MSG_KEXINIT
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.startKex] Initiating key exchange
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][Test worker][Promise.tryRetrieve] Awaiting <<kex done>>
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][sshj-Reader-localhost/127.0.0.1:22-1699307992560][Promise.deliver] Setting <<kex done>> to `null`

This is a race condition. What happens under the covers is:

  1. The key exchange is initiated by the reader thread when it receives SSH_MSG_KEXINIT from the server before the main thread has a chance to initiate that exchange.
  2. Before the reader thread calls done.clear() in KeyExchanger.startKex, the main thread also calls KeyExchanger.startKex, skips the key exchange initiation (because it's already ongoing) and calls done.await with a timeout to wait for the exchange to complete.
  3. done.await calls Promise.retrieve which calls Promise.tryRetrieve, which waits for the associated condition via cond.await.
  4. The reader thread calls done.clear(), which delivers a value of null to the done promise.
  5. This causes cond.await to wake up the main thread, and Promise.tryRetrieve returns this null value.
  6. Promise.retrieve throws a timeout exception because the retrieved value is null, even though there wasn't a timeout.
rasantel commented 7 months ago

Please disregard the mention above from PR https://github.com/hierynomus/sshj/pull/911. I referenced the wrong issue there (I meant issue #910, not #909).

By the way, I'm looking into the code to see if I can propose a fix and raise a PR for #909.