Open stevana opened 4 years ago
Jsch has been nothing but flaky, and I've been working around these bugs for seven-odd years. There is already a retry built in to jepsen.control--you should see log lines about retrying SSH failures prior to this point. Unfortunately, sometimes we get enough errors in a row that the retry gives up--and it's not clear to me that we want to retry indefinitely--it could cause Jepsen to lock up forever.
The best way to fix this, IMO, is to either find and patch the underlying bug, or to replace Jsch with another SSH library.
So I guess the reason it's not retried is because it's not in the list of .getMessages
in the or
below?
(defn ssh*
"Evaluates an SSH action against the current host. Retries packet corrupt
errors."
[action]
(with-retry [tries *retries*]
(when (nil? *session*)
(throw+ (merge {:type ::no-session-available
:message "Unable to perform a control action because no session for this host is available."}
(debug-data))))
(rc/with-conn [s *session*]
(assoc (execute! s action)
:host *host*
:action action))
(catch com.jcraft.jsch.JSchException e
(if (and (pos? tries)
(or (= "session is down" (.getMessage e))
(= "Packet corrupt" (.getMessage e))))
(do (Thread/sleep (+ 1000 (rand-int 1000)))
(retry (dec tries)))
(throw+ (merge {:type ::ssh-failed}
(debug-data)))))))
Could I just add (= "java.lang.NullPointerException" (.getMessage e))
to the or
? I'm guessing that's the message part given that the exception is com.jcraft.jsch.JSchException: java.lang.NullPointerException
?
That's a possibility, but that raises the question... is this an error that should be retried? Usually we get packet corrupt or session is down; an NPE could indicate any number of things. We might hit that, for instance, if a user provides a nil value in an SSH map, so... maybe early throws are worthwhile? I dunno, worth investigating.
Agreed, this will probably mask an underlying bug. This exception is only thrown sporadically though, maybe 1/50 tests.
if a user provides a nil value in an SSH map
I'm not sure what that means, or why that would happen 1 in 50 times with the same test.
Ah! Like, imagine a user runs a test like {:ssh {:username nil}}
, or maybe called a jepsen.control function without an SSH session bound--those could also cause NPEs. If this is sporadic though, it sounds like it's probably a bug in JSch.
I see!
Given that they've not even responded to your ticket from 2016 about "packet corrupt", do you think it makes sense to mask this bug as well? :-)
Yes, because I know that one's a transient error. NullPointerException, though, could mean lots of things... in this case, it looks like a transient problem, but in other cases, it could be a misconfigured test. Adding a retry isn't... the end of the world, but it might make debugging a bit more difficult. I don't have a strong feeling here, but I do want to encourage exploring alternate SSH options or patching JSch. :)
I'm trying out the solution of catching of the NullPointerException locally now, will run for a couple of days and see how it goes.
I do want to encourage exploring alternate SSH options or patching JSch
Agreed, these are better solutions, but also a lot more time consuming...
Yeah, there's absolutely nothing stopping you from catching this in your DB setup/teardown too--I do this pretty frequently.
I'm happy catching this error locally in my teardown, and I've come around and agree with you that Jepsen shouldn't mask underlying problem. Feel free to close this issue.
I've started sporadically seeing
com.jcraft.jsch.JSchException: java.lang.NullPointerException
getting thrown during teardown:Is that something we should catch and retry on? If so, where exactly would it be best to do so?