A common fault that affects applications is the drop of open connections to services such as databases. The drops may be caused by network issues or due to saturation in the server side. In any case, the application should be prepared for handing these drops and reestablish the connections. This process can be particularly complex when connection pools are used, because the health of the available connections in the pool must be updated..

In Kubernetes, such services are deployed as Pods (for example, as a stateful set), therefore this fault should be supported by the PodDisruptor.

Possible implementations

There seems to be multiple ways to implement this. Perhaps the most prominent are:

(A) iptables -m statistic, which should be able to terminate connections with -j REJECT --with-tcp-reset.
(B) Capturing packets and forging a tcp RST when we see a connection that we want to kill. This is what tcpkill and tcpbutcher do.
(C) Register a userspace handler for netfilter (-j NFQUEUE), which replaces packet capture, and:
- (C1) Forging a tcp RST packet when we want to terminate a connection, then drop the packet
- (C2) Set a mark on the packet, which then iptables looks for and rejects it with --with-tcp-reset

Solutions that wouldn't work well for this are:

tc's netem, which supports probabilistic packet loss, but is not very useful for connection dropping as TCP will transparently recover from (non-extreme) packet loss.

Solution A (plain iptables)

This would be the easiest solution, as it only requires setting a netfilter rule. However, it is also the least flexible: We are limited to terminating connections using only the logic that is exposed in the iptables commands. This means that we would not be able to, for example, take random decisions based on connections, only on packets.

Solution B (capture packets, forge RST)

This is what tcpkill and tcpbutcher do, which does not require any interaction with netfilter. This approach works well but has two caveats.

The first caveat is that there are some nuances related to packet injection with libpcap (which tcpbutcher uses, but tcpkill doesn't), which in some experiments caused RSTs to not be correctly sent to the local end of the connection, causing only the remote end to be terminated. I haven't researched this deeply and it might be a solvable issue.

The second caveat is that packet capture and forging occur concurrently with the normal connection flow. This is important as, for forged RSTs to work, their SEQ number needs to land in the current TCP window. If the application is processing a large amount of traffic in a short amount of time, it is possible for the window to move past the forged RST's SEQ before it gets sent, rendering it useless.

tcpkill works around this by defining a "severity" (or "aggresiveness") parameter N, and sending not one but N RST packets for the current and next N windows:

https://github.com/ggreer/dsniff/blob/2598e49ab1272873e4ea71d9b3163ef7edcc40ea/tcpkill.c#L70-L71

This strategy is an improvement, but still not guaranteed to work.

Solution C (nfqueue)

The caveat above of injection being concurrent with normal flow of packets can be removed by replacing packet capture with NFQUEUE: Instead of asynchronously capturing traffic, we force netfilter to send every packet to us and wait for us to come to a decision. This way, we can guarantee that our packet is sent before the window moves.

This, however, comes at the cost of performance: Our userspace code can become a bottleneck as every potential packet will need to flow through it. We will need to do some experimentation to assess how fast we can process packets.

Solution C1 would use libpcap for injecting the RST packet. Solution C2 would, instead, put a flag on the packet so a subsequent iptables rule can -j REJECT --with-tcp-reset.

Terminating connections

Independently of the technical solution, it might also be worth discussing how we want to model connection termination. Apart from some matching criteria (e.g. for a given destination port), we will want to specify how many connections to kill.

Option 1: Pure random percentage

The simplest, non-fair approach would be for the user to specify a percentage of connections to be terminated. For each packet matching the criteria, we check whether a random number in [1, 100] is smaller than the percentage and if its, we terminate the connection.

This would most likely not be a good solution, as connections with high traffic would have a higher chance to be terminated than connections with lower traffic.

Option 2: Percentage using 4-tuple hash

The simplest approach that is fair could be for the user to specify a percentage of connections to be terminated (e.g. 10%), and then compare that with the modulus of the 4-tuple hash for the connection. That way, we can hash the 4-tuple for each packet (source IP, source port, destination IP, destination port), take its modulus 100, and compare if it is smaller than the target percentage. If it is, we terminate the connection. As TCP requires acknowledgements, we only need to do this for either ingress or egress packets.

By checking the 4-tuple, which is constant for a given connection, instead of a per-packet random number, we keep the same chance of termination regardless of throughput.

The downside, however, is that a given connection would either be killed instantly, or never killed. This does not map very well to real-world connection dropping cases, and on top of that, over time it will converge to a set of connections that are naturally selected to be never killed

Option 3: Percentage using 4-tuple hash and truncated time

To work around the issues of option 2, we can integrate a truncated timestamp into the 4-tuple hash. This, way, for each packet, we can compute the hash of: (source IP, source port, destination IP, destination port, truncated timestamp), where by truncated timestamp we mean a timestamp with a maximum resolution, e.g. 10s. By doing this, the hash will change for a given connection every time the maximum resolution passes, e.g. every 10s. Then, like option 2, we take the modulus of this hash and compare it with a user-defined percentage.

This strategy ensures that:

Connections are considered fairly regardless of traffic: More packets do not increase the chance of the connection being terminated.
Connections that were not selected to be terminated on a given time, will be re-checked periodically.

To the user, this would be exposed simply as a percentage and a time (our resolution), and documented as "Terminate % of the connections every T seconds".

Thanks for the detailed explanation of the alternatives and their tradeoffs.

First of all, I think it is important to contextualize this discussion around the API we what we wan to offer to the developers.

The way I see the fault injection API for dropping connections would be as shown below: drop a percentage of the connections towards a target port for a duration specified in the duration of the injectDropConnectionFault method. Notice that the IP address is not specified because it is dynamically set for each target pod.

fault = {
    port: <target port>,
    rate: <percentage of connections to drop>
}

disruptor.injectDropConnectionFault(fault, '10s')

Based on this requirement and your analysis, I would lean toward exploring the use of NFQUEUE as it seems to offer more flexibility.

As for the performance concerns, I have not investigated this in detail, but I think we can optimize this by using session marks. For example, instruct iptables to forward only non-marked packages to our program and once we process a package for a session (regardless of the decision) mark it to prevent the forwarding of further packages.

But definitely, we need to evaluate the overhead.

Regarding the mechanism used for deciding to terminate a connection, I think we should only consider two parameters: the source IP and port, as we are disrupting connections towards a fixed destination (IP and port).

I'm not sure about the difference between options 2 and 3 (consider the timestamp) I'm not sure if we need to reconsider the decision periodically because we are making this decision on each package. Therefore, if for a package we decide not to drop the connection, we will reevaluate this decision for the next package. Could you elaborate on the scenario you consider this periodic re-evaluation is needed?

I would lean toward exploring the use of NFQUEUE as it seems to offer more flexibility.

I agree, I think NFQUEUE + marking is the most interesting path to explore.

instruct iptables to forward only non-marked packages to our program and once we process a package for a session (regardless of the decision) mark it to prevent the forwarding of further packages.

This is very interesting, I didn't know iptables could "keep track" of per-session marks. However I think this would have the disadvantage mentioned in option 2 (more below)

I'm not sure if we need to reconsider the decision periodically because we are making this decision on each package.

We can do a dice roll per packet, but I'm not sure we should: If we do, high-throughput connections will get terminated way faster than low-throughput ones, simply because the former will roll the dice many more times than the latter. If we want to simulate a scenario where a server drops connections, I think we would want a behavior that is not sensible to throughput.

Option 2 aims to solve that by making the result of the dice roll the same for a given the connection (4-tuple), so it doesn't matter how many times you roll it. Option 3 improves on that by adding a timeframe to ensure that every N seconds, dies are rolled again for each connection.

This is very interesting, I didn't know iptables could "keep track" of per-session marks. However I think this would have the disadvantage mentioned in option 2 (more below)

I don't follow you here. The idea is to make the decision to drop the session once per session and then stop intercepting packages for that session, this is particularly important for those sessions we decide not to drop as we are not adding overhead.

because we are making this decision on each package. Therefore, if for a package we decide not to drop the connection, we will reevaluate this decision for the next package.

We can do a dice roll per packet, but I'm not sure we should: If we do, high-throughput connections will get terminated way faster than low-throughput ones

I agree. We should make make the decision per session not per package. However, If understood correctly, then option 2 fixes the problem of unfairness. Therefore, I still don't understand why we want to re-evaluate that decision.

What concerns me is that we can end up dropping more sessions than requested. That is, instead of dropping 10% of the sessions in a period of 10s we will drop 10% each second, so after 10s we can potentially drop 100% of the original sessions.

What we could do is use the requested duration for truncating the timestamp to ensure we don't drop more than the requested percentage over the duration of the fault injection, but I still need to simulate this scenario in my head.

The main problem I see is that for a given pool of sessions, it is very likely that only a small number of them are active and we won't see any traffic for the rest of them. Therefore it can happen we don't drop as many sessions as requested just because we don't see them.

As I said before, we probably should simulate the different options before making any decision.

That is, instead of dropping 10% of the sessions in a period of 10s we will drop 10% each second, so after 10s we can potentially drop 100% of the original sessions.

This is true, and might not be entirely expected, However I think the alternative is not ideal either:

Let's consider an scenario where an application has a pool of 10 connections, and the test defines a 20% connection drop. Statistically, 8 connections will be left untouched, and 2 will be terminated. However, as the application re-opens those connections, those 2 new connections will have a 20% chance of being terminated, so the most likely scenario is that both survive. We now have 10 healthy connections that will never get terminated, even if the duration of the test is several minutes.

I'm starting to think both scenarios can be valid:

Terminate 10% of connections once and for new connections (no timestamp)
Periodically terminate 10% of connections (timestamp)

As the difference in implementation between these two proposals is very small (either add or not add a truncated timestamp to the hash), I suggest we start with the simplest (not adding the truncated timestamp to the hash) and see what users think. Adding the second scenario or changing the behavior should be pretty easy.

I'm starting to think these two scenarios can be both valid:

Terminate 10% of connections existing at the start of the test

Terminate 10% of connections as they appear

yes, I think this distinction is important and also that both are potentially valid.

As the difference in implementation between these two proposals is very small

This is good, but I'm more concerned about the developer experience. Can we define them (and their differences) easily? Can we use one or the other using an option in the fault?

I suggest we start with the simplest (not adding the truncated timestamp to the hash)

What of the above two scenarios does this correspond to? the second one?

Can we define them (and their differences) easily? Can we use one or the other using an option in the fault?

I think that with some documentation effort and careful wording we should be able to differentiate them. For example, we can describe the first scenario as:

const networkFault = {
  dropRate = 0.1,
};

A non-recurrent NetworkFault will terminate dropRate% of active connections to the target, once. Subsequent connections made to the target while the fault is active will have a dropRate% chance of failing.

As for the recurrent case, we could model it as:

const networkFault = {
  dropRate = 0.1,
  dropEvery = "10s",
};

A recurrent NetworkFault (dropEvery != null) will periodically terminate dropRate% of active connections to the target as specified by dropEvery, while the fault is active. Subsequent connections made to the target while the fault is active will have a dropRate% chance of failing.

This description is probably not perfect (in particular I do not like the dropEvery name) but I think we can iterate on that and count on external feedback to help make it clear.