jepsen-io / jepsen

A framework for distributed systems verification, with fault injection
6.69k stars 710 forks source link

Add wide-area network nemesis to nemesis.combined. #452

Closed stevana closed 3 years ago

stevana commented 4 years ago

I agree that WAN nemesis should operate on the level of datacenters.

Perhaps I've simply chosen a bad name for the nemesis. What if we call it bad-network-nemesis, or something less of a mouthful, instead and leave it targeting individual nodes?

Once you are done figuring out how datacenters work, then it would make sense to lift both bad-network-nemesis and partition-nemesis to the level of datacentres. Perhaps even some clock skews (bad ntp server?) or kill (power outage?) might make sense to operate on datacenters?

Anyhow, it seems like at least two separate PRs (introduce datacenters and make nemesis.combined datacenter aware) that don't necessarily have to hold up this one from going in (because most of nemesis.combined will have to be refactored anyway)? :-)

aphyr commented 4 years ago

Perhaps I've simply chosen a bad name for the nemesis. What if we call it bad-network-nemesis, or something less of a mouthful, instead and leave it targeting individual nodes?

Naw, it's that (part of) the wan nemesis already exists in the form of the partition nemesis package, and it doesn't make sense to have two nemeses that do the same thing. Flaky and slow networks should be new, possibly separate nemesis packages, with basically the same structure.

(because most of nemesis.combined will have to be refactored anyway)?

I don't think any of the existing nemesis.combined needs to be refactored. I'd start by adding some kind of datacenter target to the existing partition nemesis, relying on a :datacenters field in the test--I think once you've got that, everything else will basically fall into place.