jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.69k stars 710 forks source link

`com.jcraft.jsch.JSchException: java.lang.NullPointerException` during teardown #457

Open stevana opened 4 years ago

stevana commented 4 years ago

I've started sporadically seeing com.jcraft.jsch.JSchException: java.lang.NullPointerException getting thrown during teardown:

...
INFO [2020-04-29 07:35:20,487] jepsen test runner - jepsen.core Snarfing log files
INFO [2020-04-29 07:35:20,608] jepsen node n1 - jepsen.core downloading /opt/db/stdout.log to stdout.log
INFO [2020-04-29 07:35:20,631] jepsen node n2 - jepsen.core downloading /opt/db/stdout.log to stdout.log
INFO [2020-04-29 07:35:20,631] jepsen node n3 - jepsen.core downloading /opt/db/stdout.log to stdout.log
INFO [2020-04-29 07:35:20,640] jepsen node n4 - jepsen.core downloading /opt/db/stdout.log to stdout.log
INFO [2020-04-29 07:35:20,651] jepsen node n5 - jepsen.core downloading /opt/db/stdout.log to stdout.log
INFO [2020-04-29 07:35:22,812] jepsen node n2 - jepsen.control.util Stopping /tmp/db.pid
INFO [2020-04-29 07:35:22,855] jepsen node n5 - jepsen.control.util Stopping /tmp/db.pid
INFO [2020-04-29 07:35:22,860] jepsen node n1 - jepsen.control.util Stopping /tmp/db.pid
INFO [2020-04-29 07:35:22,874] jepsen node n3 - jepsen.control.util Stopping /tmp/db.pid
INFO [2020-04-29 07:35:22,914] jepsen node n4 - jepsen.control.util Stopping /tmp/db.pid
WARN [2020-04-29 07:35:22,944] jepsen node n5 - jepsen.control Encountered error with conn [:control "n5"]; reopening
com.jcraft.jsch.JSchException: java.lang.NullPointerException
    at com.jcraft.jsch.Channel.connect(Channel.java:159) ~[jsch-0.1.53.jar:na]
    at com.jcraft.jsch.Channel.connect(Channel.java:145) ~[jsch-0.1.53.jar:na]
    at clj_ssh.ssh$connect_channel.invokeStatic(ssh.clj:512) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$connect_channel.invoke(ssh.clj:509) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh_exec_proc.invokeStatic(ssh.clj:667) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh_exec_proc.invoke(ssh.clj:646) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh_exec.invokeStatic(ssh.clj:677) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh_exec.invoke(ssh.clj:670) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh.invokeStatic(ssh.clj:723) ~[jepsen-0.1.18.jar:na]
    at clj_ssh.ssh$ssh.invoke(ssh.clj:699) ~[jepsen-0.1.18.jar:na]
    at jepsen.control.SSHRemote.execute_BANG_(control.clj:351) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_$fn__2926.invoke(control.clj:172) [jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172) [jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invoke(control.clj:168) [jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.invokeStatic(control.clj:194) [jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.doInvoke(control.clj:191) [jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:137) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invoke(core.clj:660) [clojure-1.10.1.jar:na]
    at jepsen.control$exec.invokeStatic(control.clj:210) [jepsen-0.1.18.jar:na]
    at jepsen.control$exec.doInvoke(control.clj:204) [jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:421) [clojure-1.10.1.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invokeStatic(util.clj:298) [jepsen-0.1.18.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invoke(util.clj:292) [jepsen-0.1.18.jar:na]
    at db.db$teardown.invokeStatic(db.clj:148) [na:na]
    at db.db$teardown.invoke(db.clj:145) [na:na]
    at db.db$db$reify__14114.teardown_BANG_(db.clj:205) [na:na]
    at jepsen.db$fn__3142$G__3138__3146.invoke(db.clj:10) [jepsen-0.1.18.jar:na]
    at jepsen.db$fn__3142$G__3137__3151.invoke(db.clj:10) [jepsen-0.1.18.jar:na]
    at clojure.core$partial$fn__5839.invoke(core.clj:2625) [clojure-1.10.1.jar:na]
    at jepsen.control$on_nodes$fn__3036.invoke(control.clj:450) [jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:154) [clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:142) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) [clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) [clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:408) [clojure-1.10.1.jar:na]
    at dom_top.core$real_pmap_helper$build_thread__214$fn__215.invoke(core.clj:146) [jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) [clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) [clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.10.1.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) [clojure-1.10.1.jar:na]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]
Caused by: java.lang.NullPointerException: null
    at com.jcraft.jsch.ChannelExec.start(ChannelExec.java:58) ~[jsch-0.1.53.jar:na]
    at com.jcraft.jsch.Channel.connect(Channel.java:152) ~[jsch-0.1.53.jar:na]
    ... 53 common frames omitted
WARN [2020-04-29 07:35:23,453] main - jepsen.core Test crashed!
clojure.lang.ExceptionInfo: throw+: {:dir "/", :private-key-path nil, :password "root", :username "root", :type :jepsen.control/ssh-failed, :port 22, :strict-host-key-checking false, :host "n5", :sudo "root", :dummy nil, :session {:open #object[jepsen.control$session$fn__3022 0x26eb4d1c "jepsen.control$session$fn__3022@26eb4d1c"], :close #object[jepsen.control$fn__2881$G__2810__2886 0x4f6d9924 "jepsen.control$fn__2881$G__2810__2886@4f6d9924"], :log? true, :name [:control "n5"], :lock #object[java.util.concurrent.locks.ReentrantReadWriteLock 0x3feff1d3 "java.util.concurrent.locks.ReentrantReadWriteLock@3feff1d3[Write locks = 0, Read locks = 0]"], :conn #object[clojure.lang.Atom 0x19ee0e1b {:status :ready, :val #jepsen.control.SSHRemote{:session #object[com.jcraft.jsch.Session 0x59cf059c "com.jcraft.jsch.Session@59cf059c"]}}]}}
    at slingshot.support$stack_trace.invoke(support.clj:201) ~[knossos-0.3.6.jar:na]
    at jepsen.control$ssh_STAR_$fn__2926.invoke(control.clj:172) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invoke(control.clj:168) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.invokeStatic(control.clj:194) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.doInvoke(control.clj:191) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:137) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invoke(core.clj:660) ~[clojure-1.10.1.jar:na]
    at jepsen.control$exec.invokeStatic(control.clj:210) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec.doInvoke(control.clj:204) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:421) ~[clojure-1.10.1.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invokeStatic(util.clj:298) ~[jepsen-0.1.18.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invoke(util.clj:292) ~[jepsen-0.1.18.jar:na]
    at db.db$teardown.invokeStatic(db.clj:148) ~[na:na]
    at db.db$teardown.invoke(db.clj:145) ~[na:na]
    at db.db$db$reify__14114.teardown_BANG_(db.clj:205) ~[na:na]
    at jepsen.db$fn__3142$G__3138__3146.invoke(db.clj:10) ~[jepsen-0.1.18.jar:na]
    at jepsen.db$fn__3142$G__3137__3151.invoke(db.clj:10) ~[jepsen-0.1.18.jar:na]
    at clojure.core$partial$fn__5839.invoke(core.clj:2625) ~[clojure-1.10.1.jar:na]
    at jepsen.control$on_nodes$fn__3036.invoke(control.clj:450) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:154) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:142) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) ~[clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:408) ~[clojure-1.10.1.jar:na]
    at dom_top.core$real_pmap_helper$build_thread__214$fn__215.invoke(core.clj:146) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) ~[clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.10.1.jar:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]
ERROR [2020-04-29 07:35:23,461] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
clojure.lang.ExceptionInfo: throw+: {:dir "/", :private-key-path nil, :password "root", :username "root", :type :jepsen.control/ssh-failed, :port 22, :strict-host-key-checking false, :host "n5", :sudo "root", :dummy nil, :session {:open #object[jepsen.control$session$fn__3022 0x26eb4d1c "jepsen.control$session$fn__3022@26eb4d1c"], :close #object[jepsen.control$fn__2881$G__2810__2886 0x4f6d9924 "jepsen.control$fn__2881$G__2810__2886@4f6d9924"], :log? true, :name [:control "n5"], :lock #object[java.util.concurrent.locks.ReentrantReadWriteLock 0x3feff1d3 "java.util.concurrent.locks.ReentrantReadWriteLock@3feff1d3[Write locks = 0, Read locks = 0]"], :conn #object[clojure.lang.Atom 0x19ee0e1b {:status :ready, :val #jepsen.control.SSHRemote{:session #object[com.jcraft.jsch.Session 0x59cf059c "com.jcraft.jsch.Session@59cf059c"]}}]}}
    at slingshot.support$stack_trace.invoke(support.clj:201) ~[knossos-0.3.6.jar:na]
    at jepsen.control$ssh_STAR_$fn__2926.invoke(control.clj:172) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$ssh_STAR_.invoke(control.clj:168) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.invokeStatic(control.clj:194) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec_STAR_.doInvoke(control.clj:191) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:137) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invoke(core.clj:660) ~[clojure-1.10.1.jar:na]
    at jepsen.control$exec.invokeStatic(control.clj:210) ~[jepsen-0.1.18.jar:na]
    at jepsen.control$exec.doInvoke(control.clj:204) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:421) ~[clojure-1.10.1.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invokeStatic(util.clj:298) ~[jepsen-0.1.18.jar:na]
    at jepsen.control.util$stop_daemon_BANG_.invoke(util.clj:292) ~[jepsen-0.1.18.jar:na]
    at db.db$teardown.invokeStatic(db.clj:148) ~[na:na]
    at db.db$teardown.invoke(db.clj:145) ~[na:na]
    at db.db$db$reify__14114.teardown_BANG_(db.clj:205) ~[na:na]
    at jepsen.db$fn__3142$G__3138__3146.invoke(db.clj:10) ~[jepsen-0.1.18.jar:na]
    at jepsen.db$fn__3142$G__3137__3151.invoke(db.clj:10) ~[jepsen-0.1.18.jar:na]
    at clojure.core$partial$fn__5839.invoke(core.clj:2625) ~[clojure-1.10.1.jar:na]
    at jepsen.control$on_nodes$fn__3036.invoke(control.clj:450) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:154) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:142) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) ~[clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:408) ~[clojure-1.10.1.jar:na]
    at dom_top.core$real_pmap_helper$build_thread__214$fn__215.invoke(core.clj:146) ~[jepsen-0.1.18.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:665) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.10.1.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:669) ~[clojure-1.10.1.jar:na]
    at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) ~[clojure-1.10.1.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.10.1.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.10.1.jar:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

Is that something we should catch and retry on? If so, where exactly would it be best to do so?

aphyr commented 4 years ago

Jsch has been nothing but flaky, and I've been working around these bugs for seven-odd years. There is already a retry built in to jepsen.control--you should see log lines about retrying SSH failures prior to this point. Unfortunately, sometimes we get enough errors in a row that the retry gives up--and it's not clear to me that we want to retry indefinitely--it could cause Jepsen to lock up forever.

The best way to fix this, IMO, is to either find and patch the underlying bug, or to replace Jsch with another SSH library.

stevana commented 4 years ago

So I guess the reason it's not retried is because it's not in the list of .getMessages in the or below?

(defn ssh*
  "Evaluates an SSH action against the current host. Retries packet corrupt
  errors."
  [action]
  (with-retry [tries *retries*]
    (when (nil? *session*)
      (throw+ (merge {:type ::no-session-available
                      :message "Unable to perform a control action because no session for this host is available."}
                     (debug-data))))

    (rc/with-conn [s *session*]
      (assoc (execute! s action)
             :host   *host*
             :action action))
    (catch com.jcraft.jsch.JSchException e
      (if (and (pos? tries)
               (or (= "session is down" (.getMessage e))
                   (= "Packet corrupt" (.getMessage e))))
        (do (Thread/sleep (+ 1000 (rand-int 1000)))
            (retry (dec tries)))
        (throw+ (merge {:type ::ssh-failed}
                       (debug-data)))))))

Could I just add (= "java.lang.NullPointerException" (.getMessage e)) to the or? I'm guessing that's the message part given that the exception is com.jcraft.jsch.JSchException: java.lang.NullPointerException?

aphyr commented 4 years ago

That's a possibility, but that raises the question... is this an error that should be retried? Usually we get packet corrupt or session is down; an NPE could indicate any number of things. We might hit that, for instance, if a user provides a nil value in an SSH map, so... maybe early throws are worthwhile? I dunno, worth investigating.

stevana commented 4 years ago

Agreed, this will probably mask an underlying bug. This exception is only thrown sporadically though, maybe 1/50 tests.

if a user provides a nil value in an SSH map

I'm not sure what that means, or why that would happen 1 in 50 times with the same test.

aphyr commented 4 years ago

Ah! Like, imagine a user runs a test like {:ssh {:username nil}}, or maybe called a jepsen.control function without an SSH session bound--those could also cause NPEs. If this is sporadic though, it sounds like it's probably a bug in JSch.

stevana commented 4 years ago

I see!

Given that they've not even responded to your ticket from 2016 about "packet corrupt", do you think it makes sense to mask this bug as well? :-)

aphyr commented 4 years ago

Yes, because I know that one's a transient error. NullPointerException, though, could mean lots of things... in this case, it looks like a transient problem, but in other cases, it could be a misconfigured test. Adding a retry isn't... the end of the world, but it might make debugging a bit more difficult. I don't have a strong feeling here, but I do want to encourage exploring alternate SSH options or patching JSch. :)

stevana commented 4 years ago

I'm trying out the solution of catching of the NullPointerException locally now, will run for a couple of days and see how it goes.

I do want to encourage exploring alternate SSH options or patching JSch

Agreed, these are better solutions, but also a lot more time consuming...

aphyr commented 4 years ago

Yeah, there's absolutely nothing stopping you from catching this in your DB setup/teardown too--I do this pretty frequently.

stevana commented 4 years ago

I'm happy catching this error locally in my teardown, and I've come around and agree with you that Jepsen shouldn't mask underlying problem. Feel free to close this issue.