jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.68k stars 711 forks source link

nemesis.combined/packet-(nemesis, package) for packet delay, loss, corruption, duplication, and reordering. Adds packet shaping to Net protocol. #543

Closed nurturenature closed 1 year ago

nurturenature commented 1 year ago

Add the ability to delay, lose, corrupt, duplicate, and reorder packets by supporting more tc netem options in the Net protocol.

Refactor Net/(flaky!, !slow!, !fast) -> Net(flaky!, reliable!)

Faults, options, and values use tc-netem terms.


A fault map describes the desired tc-netem (network emulator) state, e.g.:

{}                         ; no faults
{:delay {}}                ; delay packets by default amount
{:corrupt {:percent :33%}} ; corrupt 33% of packets
;; delay packets by default values, plus duplicate 25% of packets
{:delay {},
 :duplicate {:percent :25% :correlation :80%}}

See jepsen.net/flaky-network-emulations for all faults and default option values.


To use in your test:

(nc/nemesis-package
 {...
  :flaky {:targets (:db-targets opts)
          :faults [; typical day
                   {:delay {},
                    :loss  {}}
                   ; ummm, about that firmware update...
                   {:corrupt {:percent :33%}}
                   ; are squirrels partying in the junction box?!?
                   {:delay {},
                    :loss  {},
                    :corrupt   {:percent :15%},
                    :duplicate {:percent :25% :correlation :80%},
                    :reorder   {:percent :30% :correlation :80%}}]}})

Reads well in the log:

:nemesis   :info   :start-flaky    [:minority {:delay {}}]
:nemesis   :info   :start-flaky    {"n1" :flaky, "n5" :flaky}
...
:nemesis   :info   :stop-flaky     nil
:nemesis   :info   :stop-flaky     {"n1" :reliable, "n2" :reliable, "n3" :reliable, "n4" :reliable, "n5" :reliable}

Affects db behavior:

flaky-nemesis-latency-raw

Can elicit bugs:

;; g-set
{:final-reads {:valid? false,
               :missing-count 11,
               :missing (2344
                         2346
                         2348
                         ...)}}

Be observed from cli:

root@n1:~# ping n2
64 bytes from n2 (192.168.122.102): icmp_seq=1068 ttl=64 time=0.194 ms
# on n2: delay 50ms 10ms 25% distribution normal
64 bytes from n2 (192.168.122.102): icmp_seq=1069 ttl=64 time=49.5 ms

# on n2: corrupt 50% 80%
64 bytes from n2 (192.168.122.102): icmp_seq=858 ttl=64 time=0.228 ms
ping: Warning: time of day goes back (-268435455999786us), taking countermeasures

# on n2: duplicate 33% 80%
64 bytes from n2 (192.168.122.102): icmp_seq=382 ttl=64 time=1.03 ms (DUP!)

It's an API change.

I think that's Ok?

I tried using current Net/(flaky!, slow!, fast!) and it has issues:

So wondering how much of an impact an API change would actually have?


net/(noop, iptables, ipfilter) have been updated for the new Net protocol.

If the PR is accepted, maelstrom.net/net can also be updated.

Developed and tested with LXC on Debian 11.

Please let me know if you want any renaming, code, doc, or any other changes.

aphyr commented 1 year ago

OOoooh, yes please, this is long overdue! I've got a few comments here--mostly around wanting to preserve as much compatibility as we can, because there are some non-public tests that rely on these APIs as well and they're going to break if we do this. Would definitely like to see this merged though! :-)

nurturenature commented 1 year ago

Yes to preserving compatibility for existing users!

I think that means leaving flaky!/slow!/fast! exactly as is.

My first attempt was to create a package for flaky! and slow!, and it's not a good fit for a combined nemesis.

What a more general purpose packet nemesis wants is the ability to :start a netem (network emulation) configured with specified faults, and then to :stop, return to normal operation.

tc works with netem configs as a whole. You can't just add a bit 'o corruption to an existing delay. You add/del/repl/chg/etc the entire netem config.

So, how about adding a single new fn to the Net protocol?

(defn shape
  "Shape packet egress traffic.
  Configures a network emulator with the given faults and applies it to the test nodes.

  To remove all faults and return to normal operation, use `faults = nil`."
  [net test faults])

And renaming the new (in the PR) flaky!-(nemesis, package) to packet!-(nemesis, package)?

Would making these changes:

be helpful to move things forward?

Thanks!

aphyr commented 1 year ago

I mean, I like your ideas here, and I'm willing to do API changes--I just want to offer users as much compatibility as possible, and if that's not possible we should give them some kind of polite error message so they know what's breaking. That could be adding a new shape! protocol fn, but I'll leave that to your judgement--you probably understand this code better than me right now! I also think that having a nemesis.combined implementation like you've done here is a good idea. :-)

nurturenature commented 1 year ago

Not refactoring (flaky!, slow!, fast!), and instead focusing on an ergonomic packet-nemesis, turns out pretty good.

Language is now about packets and how to shape them with faults.

Combined nemesis package:

(nc/nemesis-package
 {:packet {:targets (:db-targets opts),
           :faults  [{:delay {}, :loss  {}},
                     {:corrupt {:percent :33%}},
                     {:duplicate {:percent :25%, :correlation :80%},
                      :reorder    {:percent :30%, :correlation :80%}}]}})

Using the packet-nemesis specifically in a test:

;; trying to test something specific
(->> (:generator workload)
     (gen/nemesis
      (gen/flip-flop
       (repeat {:type  :info
                :f     :start-packet
                ;; bug fixed?, be heavy handed
                :value [:primaries {:corrupt :percent :80%}]})
       (repeat {:type  :info
                :f     :stop-packet
                :value nil})))
     ...)

The logs:

:nemesis   :info   :start-packet   [:minority {:delay {}, :loss {}}]
:nemesis   :info   :start-packet   {"n5" :shaped, "n3" :shaped}
...
:nemesis   :info   :stop-packet    nil
:nemesis   :info   :stop-packet    {"n1" :reliable, "n2" :reliable, "n3" :reliable, "n4" :reliable, "n5" :reliable}

The Net protocol was extended by adding (shape! net test faults).


I think it now fits quite nicely with the existing combined nemesis packages and with no other impacts.

aphyr commented 1 year ago

OK, this is progress! I think my earlier comment about :faults still seems to apply here--we already use the word :fault to refer to a keyword naming a discrete class of failure, and here it seems like it's more of a... sub-fault, mode, scenario.... Can you think of a different thing to call it?

nurturenature commented 1 year ago

Oops, my :fault :upside_down_face: .

How about a story-line of:

The Net protocol has learned how to shape network behavior to disrupt packets:

(shape! [net test behaviors] "Shapes network behavior, i.e. packet delay, loss, corruption, duplication, and reordering.")

(def all-packet-behaviors
  "All of the available network packet behaviors, and their default option values.")

So a packet-nemesis can :start/:stop behaviors that disrupts packets:

{:f :start-packet :value [:node-spec, {:delay {}, :loss {:percent :33%}}]}
{:f :stop-packet  :value nil}

And a packet-package can specify a collection of packet behaviors to apply to targets:

{:packet
  {:targets      ; A collection of node specs, e.g. [:one, :all]
   :behaviors [  ; A collection of network behaviors that disrupt packets, e.g.:
      {}                         ; no disruptions
      {:delay {}}                ; delay packets by default amount
      {:corrupt {:percent :33%}} ; corrupt 33% of packets
      ;; delay packets by default values, plus duplicate 25% of packets
      {:delay {},
       :duplicate {:percent :25% :correlation :80%}}]}}

Appearing in the logs as:

:nemesis  :info   :start-packet   [:minority {:delay {}, :loss {}}]
:nemesis  :info   :start-packet   {"n3" :shaped, "n1" :shaped}
...
:nemesis  :info   :stop-packet    nil
:nemesis  :info   :stop-packet    {"n1" :reliable, "n2" :reliable, "n3" :reliable, "n4" :reliable, "n5" :reliable}

I think this is also easy to explain, teach: "Let's add a Nemesis that changes the network behavior to disrupts packets".

Other possibilities from reading the tc-netem docs:

And from a glance at the thesaurus:

The latest commit went with behavior(s).

Does that read right to you, or should I try a different framing?

aphyr commented 1 year ago

Yeah, behavior works for me! This looks good, let's go ahead and merge!

Sorry it's taken a while to review code--this week has been absolutely brutal on my schedule. Thanks for bearing with me. :-)

Oh, and if you want, https://gist.github.com/aphyr/6d40e0febf25d12653b0abb868b6bf9e shows how to generate netem commands that don't break Jepsen's own connection to the node, which means we don't accidentally lock ourselves out of the cluster. :-)

nurturenature commented 1 year ago

Thanks for merging!

I always feel like I'm getting something for free using the nemesis.combined packages.

re net-flaky-nemesis.clj and

which means we don't accidentally lock ourselves out of the cluster. :-)

Wow! I almost put in a mention that if your test crashes, you break-out, etc, that the db node can be hard to manage and it may be easier/necessary to restart the container.

I'll look at making Net/shape! not shape src/dest control node traffic. Jepsen takes care to not affect the control node's abilities and the packet-nemesis should too.