Closed nurturenature closed 1 year ago
OOoooh, yes please, this is long overdue! I've got a few comments here--mostly around wanting to preserve as much compatibility as we can, because there are some non-public tests that rely on these APIs as well and they're going to break if we do this. Would definitely like to see this merged though! :-)
Yes to preserving compatibility for existing users!
I think that means leaving flaky!/slow!/fast! exactly as is.
My first attempt was to create a package for flaky! and slow!, and it's not a good fit for a combined nemesis.
What a more general purpose packet nemesis wants is the ability to :start a netem (network emulation) configured with specified faults, and then to :stop, return to normal operation.
tc works with netem configs as a whole. You can't just add a bit 'o corruption to an existing delay. You add/del/repl/chg/etc the entire netem config.
So, how about adding a single new fn to the Net protocol?
(defn shape
"Shape packet egress traffic.
Configures a network emulator with the given faults and applies it to the test nodes.
To remove all faults and return to normal operation, use `faults = nil`."
[net test faults])
And renaming the new (in the PR) flaky!-(nemesis, package)
to packet!-(nemesis, package)
?
Would making these changes:
flaky!/slow!/fast!
packet-nemesis
combined nemesis packagebe helpful to move things forward?
Thanks!
I mean, I like your ideas here, and I'm willing to do API changes--I just want to offer users as much compatibility as possible, and if that's not possible we should give them some kind of polite error message so they know what's breaking. That could be adding a new shape!
protocol fn, but I'll leave that to your judgement--you probably understand this code better than me right now! I also think that having a nemesis.combined implementation like you've done here is a good idea. :-)
Not refactoring (flaky!, slow!, fast!)
, and instead focusing on an ergonomic packet-nemesis, turns out pretty good.
Language is now about packets and how to shape them with faults.
Combined nemesis package:
(nc/nemesis-package
{:packet {:targets (:db-targets opts),
:faults [{:delay {}, :loss {}},
{:corrupt {:percent :33%}},
{:duplicate {:percent :25%, :correlation :80%},
:reorder {:percent :30%, :correlation :80%}}]}})
Using the packet-nemesis
specifically in a test:
;; trying to test something specific
(->> (:generator workload)
(gen/nemesis
(gen/flip-flop
(repeat {:type :info
:f :start-packet
;; bug fixed?, be heavy handed
:value [:primaries {:corrupt :percent :80%}]})
(repeat {:type :info
:f :stop-packet
:value nil})))
...)
The logs:
:nemesis :info :start-packet [:minority {:delay {}, :loss {}}]
:nemesis :info :start-packet {"n5" :shaped, "n3" :shaped}
...
:nemesis :info :stop-packet nil
:nemesis :info :stop-packet {"n1" :reliable, "n2" :reliable, "n3" :reliable, "n4" :reliable, "n5" :reliable}
The Net protocol was extended by adding (shape! net test faults)
.
I think it now fits quite nicely with the existing combined nemesis packages and with no other impacts.
OK, this is progress! I think my earlier comment about :faults still seems to apply here--we already use the word :fault
to refer to a keyword naming a discrete class of failure, and here it seems like it's more of a... sub-fault, mode, scenario.... Can you think of a different thing to call it?
Oops, my :fault
:upside_down_face: .
How about a story-line of:
The Net protocol has learned how to shape network behavior to disrupt packets:
(shape! [net test behaviors] "Shapes network behavior, i.e. packet delay, loss, corruption, duplication, and reordering.")
(def all-packet-behaviors
"All of the available network packet behaviors, and their default option values.")
So a packet-nemesis
can :start/:stop behaviors that disrupts packets:
{:f :start-packet :value [:node-spec, {:delay {}, :loss {:percent :33%}}]}
{:f :stop-packet :value nil}
And a packet-package
can specify a collection of packet behaviors to apply to targets:
{:packet
{:targets ; A collection of node specs, e.g. [:one, :all]
:behaviors [ ; A collection of network behaviors that disrupt packets, e.g.:
{} ; no disruptions
{:delay {}} ; delay packets by default amount
{:corrupt {:percent :33%}} ; corrupt 33% of packets
;; delay packets by default values, plus duplicate 25% of packets
{:delay {},
:duplicate {:percent :25% :correlation :80%}}]}}
Appearing in the logs as:
:nemesis :info :start-packet [:minority {:delay {}, :loss {}}]
:nemesis :info :start-packet {"n3" :shaped, "n1" :shaped}
...
:nemesis :info :stop-packet nil
:nemesis :info :stop-packet {"n1" :reliable, "n2" :reliable, "n3" :reliable, "n4" :reliable, "n5" :reliable}
I think this is also easy to explain, teach: "Let's add a Nemesis that changes the network behavior to disrupts packets".
Other possibilities from reading the tc-netem
docs:
And from a glance at the thesaurus:
The latest commit went with behavior(s).
Does that read right to you, or should I try a different framing?
Yeah, behavior
works for me! This looks good, let's go ahead and merge!
Sorry it's taken a while to review code--this week has been absolutely brutal on my schedule. Thanks for bearing with me. :-)
Oh, and if you want, https://gist.github.com/aphyr/6d40e0febf25d12653b0abb868b6bf9e shows how to generate netem commands that don't break Jepsen's own connection to the node, which means we don't accidentally lock ourselves out of the cluster. :-)
Thanks for merging!
I always feel like I'm getting something for free using the nemesis.combined packages.
re net-flaky-nemesis.clj
and
which means we don't accidentally lock ourselves out of the cluster. :-)
Wow! I almost put in a mention that if your test crashes, you break-out, etc, that the db node can be hard to manage and it may be easier/necessary to restart the container.
I'll look at making Net/shape!
not shape src/dest control node traffic.
Jepsen takes care to not affect the control node's abilities and the packet-nemesis
should too.
Add the ability to delay, lose, corrupt, duplicate, and reorder packets by supporting more
tc netem
options in the Net protocol.Refactor Net/(
flaky!, !slow!, !fast) -> Net(flaky!, reliable!)tc
tc ... add netem {opts}
tc ... del
Faults, options, and values use
tc-netem
terms.A fault map describes the desired
tc-netem
(network emulator) state, e.g.:See
jepsen.net/flaky-network-emulations
for all faults and default option values.To use in your test:
Reads well in the log:
Affects db behavior:
Can elicit bugs:
Be observed from cli:
It's an API change.
I think that's Ok?
I tried using current
Net/(flaky!, slow!, fast!)
and it has issues:fast!
flaky!
andslow!
get in each other's waytc
Exceptionsfast!
resets the other's stateSo wondering how much of an impact an API change would actually have?
net/(noop, iptables, ipfilter)
have been updated for the new Net protocol.If the PR is accepted,
maelstrom.net/net
can also be updated.Developed and tested with LXC on Debian 11.
Please let me know if you want any renaming, code, doc, or any other changes.