jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.78k stars 714 forks source link

not seeing streaming histories ~~during test run, or~~ on ^C #559

Closed nurturenature closed 1 year ago

nurturenature commented 1 year ago

When using 0.2.8-SNAPSHOT from clojars:

rm -rf ~/.m2/repository/jepsen
lein deps
...
jepsen.core: Test version 0fa9119f18b99b721010f1fa3b2ab83a4903f329 (plus uncommitted changes)

Or installing locally from main:

git status
On branch main
Your branch is up to date with 'origin/main'.

lein install

jepsen.core: Test version 0fa9119f18b99b721010f1fa3b2ab83a4903f329 (plus uncommitted changes)

Histories are not being streamed during the test run:

ls -al store/current/
jepsen.log
test.jepsen

^C does snarf log files, but no histories:

^C
Jepsen shutdown hook - jepsen.core Downloading DB logs before JVM shutdown...
Jepsen shutdown hook - jepsen.core Snarfing log files

ls -al store/current/
jepsen.log
n1
n2
n3
test.jepsen

test.jespen is readable and shows :history nil:

=> (jepsen.store/test -1)
{:history nil, ...}

test.jepsen has the magic bytes JEPSEN0001.

Any guidance?

aphyr commented 1 year ago

When you say "during the test run", do you mean you're looking at the history while the test is still running? It only checkpoints every 16384 operations in the history, so if you haven't gotten that far the history will claim to be empty.

aphyr commented 1 year ago

Oh, but for ^C handling, yes, it would be nice if it checkpointed everything! We could modify the signal handler in jepsen.core to... somehow reach into the streaming block writer (which I think is held in jepsen.interpreter) and close it. Maybe by interrupting the generator interpreter and catching the interrupt? Or by directly calling .close on the block writer? That should be thread-safe; it indirects through a concurrent queue.

nurturenature commented 1 year ago

When you say "during the test run", do you mean you're looking at the history while the test is still running? It only checkpoints every 16384 operations in the history, so if you haven't gotten that far the history will claim to be empty.

My bad, it even says so right in the commit Histories are chunked on disk into 16384-operation blocks!

The test I am currently working on doesn't make it to 16k ops before it's able to invoke a panic in a db replica(s), quorum is lost, which then impairs the db clients to the point of no return.

I am working on making the Jepsen client more tolerant of a completely unresponsive system, and then updating the generators to nil out.

but for ^C handling, yes

I'll take a look.

aphyr commented 1 year ago

Hmm, yeah generally Jepsen shouldn't crash when DBs do. You're killing it explicitly, I'm guessing? I think the general answer there is to put a timeout on client/invoke!.

But saving more of the history on ^C is good too! I'd love to see this feature--if you don't build it I'm sure I will later. Heads-down on a gnarly research project at the moment for faster history reduction, or I'd go build it right now!

nurturenature commented 1 year ago

But saving more of the history on ^C is good too! I'd love to see this feature--if you don't build it I'm sure I will later.

Experimented with:

(with-thread-name "Jepsen shutdown hook"
  (info "Downloading DB logs before JVM shutdown...")
  (snarf-logs! ~test)
  (store/update-symlinks! ~test)

  ; 👇 try a naive way to save the history and updated test
  (store.format/write-test-with-history! (->> ~test :store :handle) ~test))

to no avail so it will have to remain aspirational for me for now.

Hmm, yeah generally Jepsen shouldn't crash when DBs do. You're killing it explicitly, I'm guessing? I think the general answer there is to put a timeout on client/invoke!.

The client, in zig, tightly tied to db, can panic at times which brings down the JVM. And Jepsen is so good at finding reasons to panic. 🙂

Heads-down on a gnarly research project at the moment for faster history reduction, or I'd go build it right now!

Going to close the issue as the history does stream, ^C was just a wish, and to help focus on your research.

Just noticed jepsen.history too!

aphyr commented 1 year ago

Ahhh that does make sense! And yeah, the bit of code you want for sealing the history is to somehow invoke BigVectorBlockWriter.close. That block writer is held in jepsen.generator.interpreter. I'm not totally sure how to connect the plumbing from the shutdown hook to there, but that's what has to happen!

aphyr commented 1 year ago

And yeah, jepsen.history is... ah, I'm really excited. Been wanting this for the better part of seven years! It's close! I'm tackling what I think is the hardest problem right now, then I can go back in and start speeding up checkers.