jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.69k stars 710 forks source link

Incomplete tests in Dgraph test suite. #451

Open martinmr opened 4 years ago

martinmr commented 4 years ago

I was trying to run the dgraph tests (using our fork of Jepsen) but there are a lot of incomplete tests. Looking at the errors I see two types:

  1. "sleep interrupted" exceptions.
  2. Zero tries to start but fails because of an "address already in use" error.

I haven't been able to debug any of these two errors since they are in the core jepsen code and not in the Dgraph test suite. There was another error due to alters failing but I fixed it and the fix is included in #450

I tried merging the recent changes from this repo to the fork but that made it even worse. Now I get an error about the X11 display variable causing a library to crash. I tried setting the DISPLAY variable but it doesn't work. The test just gets stuck there. I made sure I am running the tests from a terminal emulator (not screen or tmux) and I still see the same issue.

The last time I successfully ran Jepsen we were using jepsen 0.1.15 but that's not possible anymore as the changes in the Dgraph test suite depend on changes made in later jepsen versions (the fork is currently using jepsen 0.1.17 and this repo is using 0.1.18-SNAPSHOT).

I was using the jepsen tool we have in the dgraph-io/dgraph repo (in the contrib/jepsen folder) and I added a wait in between tests (hoping this was just a timing issue) but that didn't fix it.

I'd appreciate any help as our new release is blocked on passing Jepsen tests.

Steps to reproduce.

  1. Clone dgraph's Jepsen fork (to avoid getting stuck on the X11 display issue). Or use the main repo if you don't see that error. But you'll need the changes in #450 to avoid issues with Alter requests.
  2. Run the Dgraph test suite. You don't need to run each test for long. I ran each test for 15 seconds and I am seeing the issues (also seeing them if I run the tests for longer).
  3. Most tests should pass but some are marked as incomplete.
  4. Look into the logs. You either see the "sleep interrupted" errors or the "address already in use" in the Dgraph logs.
martinmr commented 4 years ago

This is the weird error I see when I tried to merge the recent jepsen changes to our fork.

Exception in thread "main" Syntax error compiling at (rhizome/viz.clj:27:1).
        at clojure.lang.Compiler.load(Compiler.java:7647)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5989)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:457)
        at elle.viz$eval1882$loading__6706__auto____1883.invoke(viz.clj:1)
        at elle.viz$eval1882.invokeStatic(viz.clj:1)
        at elle.viz$eval1882.invoke(viz.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5989)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:512)
        at elle.txn$eval1845$loading__6706__auto____1846.invoke(txn.clj:1)
        at elle.txn$eval1845.invokeStatic(txn.clj:1)
        at elle.txn$eval1845.invoke(txn.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5989)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:512)
        at elle.rw_register$eval1837$loading__6706__auto____1838.invoke(rw_register.clj:1)
        at elle.rw_register$eval1837.invokeStatic(rw_register.clj:1)
        at elle.rw_register$eval1837.invoke(rw_register.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5989)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:421)
        at jepsen.tests.cycle.wr$eval1829$loading__6706__auto____1830.invoke(wr.clj:1)
        at jepsen.tests.cycle.wr$eval1829.invokeStatic(wr.clj:1)
        at jepsen.tests.cycle.wr$eval1829.invoke(wr.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5985)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:619)
        at jepsen.dgraph.wr$eval1220$loading__6706__auto____1221.invoke(wr.clj:1)
        at jepsen.dgraph.wr$eval1220.invokeStatic(wr.clj:1)
        at jepsen.dgraph.wr$eval1220.invoke(wr.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5989)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:703)
        at jepsen.dgraph.core$eval157$loading__6706__auto____158.invoke(core.clj:1)
        at jepsen.dgraph.core$eval157.invokeStatic(core.clj:1)
        at jepsen.dgraph.core$eval157.invoke(core.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7165)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.RT.loadResourceScript(RT.java:381)
        at clojure.lang.RT.loadResourceScript(RT.java:372)
        at clojure.lang.RT.load(RT.java:463)
        at clojure.lang.RT.load(RT.java:428)
        at clojure.core$load$fn__6824.invoke(core.clj:6126)
        at clojure.core$load.invokeStatic(core.clj:6125)
        at clojure.core$load.doInvoke(core.clj:6109)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at clojure.core$load_one.invokeStatic(core.clj:5908)
        at clojure.core$load_one.invoke(core.clj:5903)
        at clojure.core$load_lib$fn__6765.invoke(core.clj:5948)
        at clojure.core$load_lib.invokeStatic(core.clj:5947)
        at clojure.core$load_lib.doInvoke(core.clj:5928)
        at clojure.lang.RestFn.applyTo(RestFn.java:142)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$load_libs.invokeStatic(core.clj:5985)
        at clojure.core$load_libs.doInvoke(core.clj:5969)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$require.invokeStatic(core.clj:6007)
        at clojure.core$require.doInvoke(core.clj:6007)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at user$eval140$fn__144.invoke(form-init6275510242627870398.clj:1)
        at user$eval140.invokeStatic(form-init6275510242627870398.clj:1)
        at user$eval140.invoke(form-init6275510242627870398.clj:1)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.eval(Compiler.java:7166)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        at clojure.lang.Compiler.loadFile(Compiler.java:7573)
        at clojure.main$load_script.invokeStatic(main.clj:452)
        at clojure.main$init_opt.invokeStatic(main.clj:454)
        at clojure.main$init_opt.invoke(main.clj:454)
        at clojure.main$initialize.invokeStatic(main.clj:485)
        at clojure.main$null_opt.invokeStatic(main.clj:519)
        at clojure.main$null_opt.invoke(main.clj:516)
        at clojure.main$main.invokeStatic(main.clj:598)
        at clojure.main$main.doInvoke(main.clj:561)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.lang.Var.applyTo(Var.java:705)
        at clojure.main.main(main.java:37)
Caused by: java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
        at sun.awt.HeadlessToolkit.getMenuShortcutKeyMask(HeadlessToolkit.java:236)
        at rhizome.viz$eval2023.invokeStatic(viz.clj:30)
        at rhizome.viz$eval2023.invoke(viz.clj:27)
        at clojure.lang.Compiler.eval(Compiler.java:7176)
        at clojure.lang.Compiler.load(Compiler.java:7635)
        ... 208 more
martinmr commented 4 years ago

The failures we are seeing are not deterministic so we'll try to get the incomplete tests to pass in order to unblock us. But it would still be great to be able to run the full tests suite without running into these issues.

aphyr commented 4 years ago
  1. "sleep interrupted" exceptions.

This might be a bug in Jepsen's generator system, which I've been working on rewriting. Or it could be that something else happened in the test (e.g. a timeout), and it interrupted other threads. Reading through the stacktraces in jepsen.log (there might be multiple) could help.

  1. Zero tries to start but fails because of an "address already in use" error.

This sounds like an automation issue? Like maybe Dgraph isn't being killed properly?

I haven't been able to debug any of these two errors since they are in the core jepsen code and not in the Dgraph test suite.

The first could be either; the second is most likely in the Dgraph test suite.

Now I get an error about the X11 display variable causing a library to crash. I tried setting the DISPLAY variable but it doesn't work

I'm guessing you're running headless? Try adding -Djava.awt.headless=true to the jvm-opts in project.clj: https://stackoverflow.com/questions/21343529/all-my-java-applications-now-throw-a-java-awt-headlessexception

martinmr commented 4 years ago

Thanks for the quick response.

I see the "sleep interrupted" near the beginning of the test. I forgot to copy the stack trace because I saw it on my remote workstation but the stack trace didn't contain any reference to the code in the dgraph test suite so it's probably the same bug that you mentioned. The test did not even have a chance to start.

The other failure is probably in the Dgraph test suite but I haven't been able to debug it. There have been some recent changes to the test suite itself and I think the error was introduce by them because we have not seen this error before. It works most of the time but there are times that the error happens so something about how jepsen creates and destroys the cluster changed.

I'll try changing the project.clj file to see if I can get the test suite running with the latest changes.

martinmr commented 4 years ago

@aphyr, hi. I haven't been able to get past the headless error. I tried setting the option in the Dgraph project.clj file as well as in the jepsen project.clj file but it still doesn't work. I don't know what jepsen is trying to display but there should be a way to disable it. Thanks.

aphyr commented 4 years ago

I don't think I can patch it at the Jepsen level, cuz it's happening at compile time, but the project.clj fix is a one-liner, at least on every env I've used. Really, I need to go into ztellman/rhizome and patch this so it's not a compile-time thing.

:jvm-opts ["-Djava.awt.headless=true"]

martinmr commented 4 years ago

That is the exact line I added to the project.clj file but it didn't had any effect. Adding it to the jepsen project.clj file didn't have any effect either. I think you are right; since this happens at compile time setting the flag probably does nothing.

aphyr commented 4 years ago

No, that's not it; compilation happens when you call lein run, so this JVM setting should take effect then. It's possible you're having some kind of different issue--one that I haven't encountered before...