grafana / xk6-disruptor

Extension for injecting faults into k6 tests
https://k6.io/docs/javascript-api/xk6-disruptor/
GNU Affero General Public License v3.0
96 stars 8 forks source link

Dependency Disruptor #53

Open pablochacin opened 2 years ago

pablochacin commented 2 years ago

It is a common use case to test the effect of known patterns of behavior in external dependencies (services that are not under the control of the organization). Using the xk6-disruptor, this could be accomplished by implementing a Dependency Disruptor, which instead of disrupting a service (or a group of pods), disrupts the requests these pods make to other services.

This could be implemented using a similar approach used by the disruptor, injecting a transparent proxy but in this case for outgoing requests.

This approach will work well if the service is a dependency for a small set of pods (for example, the pods that back an internal service) but will not work well if many different pods (e.g. many different internal services) use this external dependency.

From the implementation perspective, the two main blockers for this functionality are:

  1. TLS termination. For external services, the most common scenario is to use encrypted communications using TLS. In this case, the disruptor cannot modify the response (e.g. the status code). Moreover, the traffic cannot be intercepted using a simple proxy because the handshaking would fail. Using eBPF may open some alternatives.

  2. How to identify the IP address(es) of the dependency. Currently, the disruptor uses iptables to redirect the traffic to the proxy that injects the faults. In the case of the dependency disruptor the traffic going to the external service is the one that must be intercepted. However, the IP address of this external dependency may not be known at the time the disruptor agent is installed, or it can change during the execution of the disruption (for example, if the external dependency uses DNS load balancing).

roobre commented 1 year ago

I think this would be super interesting to test cloud services that are notable for requiring retry-heavy code, for example S3.

Unfortunately the technical blockers seem hard to avoid. Dumping my thoughts here for future reference:

Identifying IP addresses

I don't think it's possible to do this statically, for the very reasons you comment. The naive solution to this could be to just intercept all traffic, and then filter at the application layer whether we want to do something with it or not.

A less naive solution would be to hijack dns queries for the target load. Kubernetes gives us two ways of doing this:

  1. Exposing a custom nameserver and inject it in the pods' dnsConfig.nameservers. This gives us high flexibility but it's more work (we need a nameserver).
  2. Statically injecting the domains in the pod's hostAliases. With this we need to know the target names in advance, but we don't need any nameserver (simpler and potentially more reliable).
  3. Use more aggressive techniques, like iptables/ebpf to intercept dns queries. iptables would again require us to ship a nameserver in the agent.

Options 1 and 2 would require an application restart, which is not very good.

TLS termination

Doing this transparently will most likely be hard, because TLS is designed to make this hard. First not-so-intrusive approach coming to my mind would be modifying the application's pod to use an HTTP proxy, which we would control. Doing this, the application should trust the proxy and we should have full control.

A way to do this could be setting the http_proxy (and/or https_proxy) env vars, which many applications and libraries understand by default. This could be done by patching the k8s object, but would require an application restart (again, not very clean).

pablochacin commented 1 year ago

Apparently, squid proxy implements a mechanism called sslBump for intercepting HTTPS traffic:

   ssl-bump For each CONNECT request allowed by ssl_bump ACLs,
            establish secure connection with the client and with
            the server, decrypt HTTPS messages as they pass through
            Squid, and treat them as unencrypted HTTP messages,
            becoming the man-in-the-middle.

How this is implemented:

Establish a TLS connection with the server (using client SNI, if any) and establish a TLS connection with the client (using a mimicked server certificate).

According to this tutorial, this requires a self-signed CA root certificate to be deployed in the client's SSL configuration.

roobre commented 1 year ago

Dropping some thoughts I've been having as well. I think that if we want to intercept TLS connections to trusted sources, we will be forced to perform a (benign) man-in-the-middle attack, just as described above. I see only two ways we can make this:

  1. Change (patch) the code that checks for certificate validity
  2. Add our certificate the list of valid certificates the code checks

Regarding route 1, for all libraries I know, the code that decides whether a certificate is trusted is part of a library used by the application, so there is not a universal (or reasonably wide-scoped) way to do this. Some applications will link dynamically to this library, while others will link statically. The code would need to intrude into the application, like a debugger would, and intercept this points.

Route 2 offers similar compatibility concerns, as different systems and libraries pick this list from different places. However, these systems and libraries often offer mechanisms to perform this specific task of adding certificates to the trusted pool. OpenSSL and Go, for example, will trust any certificate present on a directory listed in the SSL_CERT_DIR environment variable.

An unfortunate requirement would be that this list is typically loaded once when the application starts, so to make any change to this list effective, one would need to start the application after changing this env vars. After it has been restarted, the disruption can still be turned on and off through the usual means (iptables rules).

Another option within route 2, would be to drop certificates in places where we already know the system will look at. This could be useful if a library does not support specifying additional paths through environment variables.

To close the route 2 approach list, there's also the option off intercepting system calls for files that look like CAs, and append ours in the result. This would be more cumbersome but possible to do if we find libraries that read certificates from unpredictable paths, and do not allow modifying those paths externally.

As a summary of my thoughts, the options I can think for TLS proxying are poking into things with a debugger (hard, non-portable) or adding our CAs to the system's trust (less hard, requires restarting). We would need to asses if, within the context of who will going to use the disruptor, and in which environment will they use it, restarting the application is a downside we can live with.

pablochacin commented 1 year ago

To close the route 2 approach list, there's also the option off intercepting system calls for files that look like CAs, and append ours in the result. This would be more cumbersome but possible to do if we find libraries that read certificates from unpredictable paths, and do not allow modifying those paths externally

Which mechanism could be used for this purpose? ebpf for instance does not allow modifying the results of a syscall.

pablochacin commented 1 year ago

Add our certificate the list of valid certificates the code checks This is the same approach described in the comment above.

Route 2 offers similar compatibility concerns, as different systems and libraries pick this list from different places. However, these systems and libraries often offer mechanisms to perform this specific task of adding certificates to the trusted pool. OpenSSL and Go, for example, will trust any certificate present on a directory listed in the SSL_CERT_DIR environment variable.

An unfortunate requirement would be that this list is typically loaded once when the application starts,

Could you elaborate on why this is the case? I would expect the CA root certificate to be loaded on demand when validating a certificate that refers to il

roobre commented 1 year ago

Which mechanism could be used for this purpose? ebpf for instance does not allow modifying the results of a syscall.

I think this could be done by hijacking libc's open() with a wrapping library that we then LD_PRELOAD. However this is far from ideal, as it is very intrusive and poses potential compatibility problems. Programs not using libc would not go through this path, for example.

Could you elaborate on why this is the case? I would expect the CA root certificate to be loaded on demand when validating a certificate that refers to it

This could certainly vary between libraries and languages. I believe Go is doing it only once upon the first verification, as it is sync.Onced here. Different libraries may do it differently, although I would not expect them to do it very often as reading potentially hundreds of TLS certificates from disk can potentially be a performance hit.

pablochacin commented 1 year ago

The test implemented in this POC exploits a not well-documented feature in Docker that allows attaching a container to the network stack (or network namespace) of another container.

I don't see how this can work without restarting the application, which is not an option in our scenario.