Add support to preserve source IP using `TPROXY`

mateiidavid commented 2 years ago

This issue outlines the concepts involved in adding support for TPROXY, and proposes changes that would enable this to work with Linkerd. This was initially requested in #4713, and I know some members of the community would be glad to have it in.

The proposal leverages TPROXY and socket option IP_TRANSPARENT to preserve the IP of a client for any TCP connections. A great advantage of doing this at the firewall level is that we do not have to set any special request headers (e.g X-FORWARDED-FOR); this would be application protocol agnostic. To make this work, there are two parts:

Support TPROXY: instead of redirecting at a nat level, we'd have to add support for tproxy and change firewall rules using iptables.
Spoof client IP on the proxy side: the proxy will in most cases (more on it below) have access to the peer address of a connection. Since the proxy establishes a connection the application process, the peer address from the application's point of view will be localhost. The proxy would have to bind before connect with the original peer address, this would in effect allow the application to think it is talking to the client, and not the proxy.

Note: TCP connections are identified through a 5-tuple: (src_ip, src_port, protocol, dst_ip, dst_port). When relying on nat at a firewall level to re-write the destination (to the proxy's port), in general the source IP will be preserved. In the PREROUTING chain, it is common to only do DNAT. However, if two connections to the same host share the same IP and port, it is possible to have SNAT done in the prerouting step. Theoretically, we might be able to do this without tproxy; it does introduce some guarantees that are good to have despite the complexity.

Update: I created a proof of concept with how this would work in k8s. The proof of concept will preserve the src IP of the client; it can do so at a nat level (with current ipt rules we have in place) or at a mangle level using tproxy target. You can find the proof of concept here

TPROXY support

Introduction. Tproxy refers to a module that adds transparent proxy support to the kernel[1]. In essence, it allows to proxy traffic from a client to a server through a local socket; as far as the client is concerned, it connected successfully to the original target. The documentation outlines the steps to make this work[1]:

Identify packets with a dst matching a local socket on the box, set the mark to a certain value.
Leverage policy routing to make sure that any packets with this mark will be processed locally.
Bind local socket with a special IP_TRANSPARENT option (this let's the socket bind to a non-local address).
Make use of a special TPROXY target in iptables to have traffic routed through the socket mentioned in step (3) instead of going directly to the original dst.

Essentially, this was designed to intercept traffic on a router, even if the destination is not local, we can "impersonate it" through a local socket that has the IP_TRANSPARENT option set. This works in a similar way to the REDIRECT target, which we make use of at the moment.

Before we go on, there are some netfilter concepts to revisit.

This is a short catch-up on some iptables concepts. In total, there are four predefined tables, traversed in order: raw, mangle, nat, and filter. There are also a set of predefined chains that a packet will go through, these chains are associated with the tables. A chain may belong to multiple tables, but multiple tables may not have the same chain. For tproxy, we are concerned primarily with the `mangle` table (which will be traversed before `nat`). Mangle, as the name implies, allows us to mangle packets -- this generally means changing the IP header -- but it has another advantage. The mangle table can use the `MARK` target, which was briefly mentioned in the introduction. Simply put, we can mark certain packets with an arbitrary value (note though, the mark is something the kernel tracks internally, it's not _ON_ the packet). Marks are generally used with policy based routing at a firewall level[[2]]. We can essentially create our own routes and our own routing table and apply this "policy" only to certain packets that are marked. This is important to know for most of this stuff to make this. Setting marks is only supported in the mangle table[[2]]. Lastly, the mangle table has the same two chains we have been operating on using nat: `PREROUTING` and `OUTPUT`.

Changes required. The set-up itself isn't very complicated, we need to set-up new routing rules and intercept packets with a socket that has the IP_TRANSPARENT option. Since we are only concerned with this on the inbound side, my feeling is that we can still rely on nat to proxy requests through the outbound side. Inbound, we would have to do add a few rules:

First, we need to mark any incoming packets with an arbitrary mark, e.g 1.
Second, we need to create routing policies. We create a new routing rule whereby packets marked with 1 will be sent to an arbitrary routing table. We then add a new routing rule to this table that says any packet (0.0.0.0/0) will be treated as local.
In most examples, there is an additional optimization where a new divert chain is created for diverted packets. Here, only packets that create and close a connection will be considered: in this divert chain we then mark the packets. The optimization here basically makes use of the connection tracking state machine to only mark packets that establish a connection or close it.
Finally, we route everything on our own port using TPROXY: iptables -t mangle -A PREROUTING -p tcp -j TPROXY --tproxy-mark <0x1, our mark> --on-port <port>.

TL;DR. Tproxy works like nat, but with some extra steps. Its advantage is that it does not re-write the destination address, and it is also more reliable in the face of SNAT (which may happen if 2 connections use the same src port and ip, for whatever reason). To use the TPROXY target in iptables, we first need to mark inbound packets and then route them using policy routing -- this will ensure all packets are treated as local, instead of being forwarded. I'd argue this is not strictly necessary in k8s, since we will never act as a router (dst will always be local to proxy), but it seems to be the norm. In essence, all destination addresses will be treated as local (if you're confused, welcome to the club).

By re-using the mark, we can tell iptables to send packets to our port instead of their original destination. Neat. We are missing one step though, we have to actually impersonate the original destination.

If you're confused about routing tables, click me.

These are the two commands that the tproxy docs lists: ``` # ip rule add fwmark 1 lookup 100 # ip route add local 0.0.0.0/0 dev lo table 100 ``` Like I said before, a mark is just a field maintained in the kernel and associated with a specific packet. When a packet comes in, after it traverses PREROUTING, a routing decision has to be made. The kernel consults its routing policy database. These two commands will say, for any packet marked as 1, lookup table 100 (it can be any name). The second command adds a new routing rule to the database, it says: the scope of this route is local, it applies to any IPV4, if you see it, it has to be handled on the loopback interface; i.e it _never_ leaves the host.

`IP_TRANSPARENT`

Introduction. To successfully route traffic using the tproxy target, we also need to set the IP_TRANSPARENT option on the socket. The documentation is quite confusing for it[3], it allows us to bind to a non-local IP address. As we will see, this goes two ways; inbound and outbound.

Inbound. We need to set the option on our server socket (not the app, the proxy's server) so that we may receive traffic routed through tproxy. The local_addr on this socket will be the application's IP:PORT! For example, say we have our server bound on 0.0.0.0:5000 and our app on 0.0.0.0:3000. We set tproxy to redirect all packets to 5000. When a client sends a request to 10.0.0.1:3000, our proxy will receive it. As far as all parties are concerned, however, they think the local address is 10.0.0.1:3000 but it is infact 10.0.0.1:5000.

Outbound: when we open a connection from the proxy to the application, we need to again use this IP_TRANSPARENT option; the usecase here is a bit more subtle perhaps. We have access to the peer address from our server socket, we can re-use that and bind before connect. The peer address is not local, but IP_TRANSPARENT will let us bind to it anyway.

The actual complication comes from what the application perceives as being the client. It thinks it talks to the client, so any reply packets will be sent to the client. We need some additional rules in place to route replies back through our proxy. This is where the CONNMARK target comes into place. When we build the socket, aside from the transparent option, we also set the same mark on it as all other packets. We then add a rule to each chain: PREROUTING and OUTPUT. On the PREROUTING side, we add a CONNMARK rule whereby packets turn their mark into a connection mark. On the OUTPUT side, we restore the connection mark back into a packet mark so policy routing can be applied.

Implementation and scoped work

I wanted to give a thorough introduction on how everything works conceptually. To illustrate all of this in practice, I created a proof of concept project: https://github.com/mateiidavid/linkerd-tproxy-poc.

In the poc, we have a server and a "proxy". The proxy will set-up iptable rules and intercept traffic from a client (nc or curl) and then send requests to the server using a spoofed address. This should show how things come together. The proxy talks to the server over localhost for simplicity.

For Linkerd:

We would need to send requests to the original address of the server, as opposed to localhost, as done in the poc. I anticipate this would complicate our iptable rules slightly, but we can take advantage of connection tracking. The connection tracking mechanism records a connection as being established after the first ack is received. SYN - SYN-ACK -- connection is established, without waiting for the final ack. It is also updated before the PREROUTING and OUTPUT chains so we can most likely take advantage of that to not route any established connections with TPROXY (otherwise we'd loop endlessly).
We would need to update the init container, support different modes of running -- ideally this would be feature flagged for a while -- and also test whether the CNI plugin would continue to function. I don't anticipate any breakage on the CNI side here. Aside from that, we'd have to update helm charts.
The proxy will need to be aware of what mode to run in. If it runs in tproxy: we can avoid setting SO_ORIGINAL_DST on the socket (we can get the address directly from the socket since packet is not re-written by nat); not compulsory, we can still keep the logic in. We'd have to set different socket options though, most notably IP_TRANSPARENT. In the poc I have also used freebind and reuseaddr. This is a good article on the topic. Much of this set-up is actually inspired by mmproxy.
We will need to tread carefully around images. We moved to minimal set-ups in the init container, we'll need to package a few more utilities, such as iproute2 to configure policy routing.

The surface area of the change is not exessively large, I think we can get away with only adding inbound rules (and some additional output rules for connmarking) on the iptables side. We can feature flag this in the helm charts with --set initContainer.tproxy=true or something similar. Setting this should be reflected in the proxy template partial; my idea is to have an environment variable that let's the proxy know it has to set-up the connections differently. The code will make use of this env variable to add/remove socket options.

I'll continue updating this as discussions go on, the next step for me would be to add a checklist to this issue that will track all necessary work. I'd like to first socialize the idea and see what other people think about the proposal. I'll also update this post with more explanations if needed.

References

1: tproxy docs 2: Mark target 3: man ip

Further reading: Tproxy proof of concept in C Tproxy proof of concept in Rust, made with Linkerd in mind Cloudflare blog on tproxy mmproxy: preserving src IP

mateiidavid commented 2 years ago

As an update, when I worked on the proof of concept I was curious if we can preserve the source IP of the request by doing all packet routing at the NAT level, instead of relying on mangling headers and using the TPROXY module. Doing everything at NAT still works, provided the proxy still marks the packets and sets IP_TRANSPARENT on the socket.

The benefits of doing this at the NAT level:

Less changes for proxy-init: we will only need to route marked packets to make sure they stay local (in other words, we need to add 2 rules to make sure response packets from proxy <--> local process will stay local instead of going to the source address).
Packets traversing the nat table only do it once, when a connection is first established (only the first packet traverses it, the rest of the packets will be modified in the same way).

The main disadvantage is that source nat can happen when we do destination nat IFF two connections to the same target end up using the same srcIP and srcPORT in the TCP connection identifying 5 tuple. This seems pretty improbable to me though, and in this case everything would still work, we just wouldn't have the source IP preserved. We can handle this case in the proxy through log messages though.

The main disadvantages of relying on TPROXY and mangle:

More changes to init container, including the need to support different interception modes (nat & tproxy).
We need to treat all packets as being local, this means the iptables logic will get slightly more complicated.
Packets will always need to be checked in netfilter since mangle is traversed for every packet, as opposed to just the first one in the connection. I'm not sure if there will be a performance impact here, but it's worth noting.

Would love some opinions on this if you have the time.

mateiidavid commented 2 years ago

Update:

We had some discussions off GitHub about this recently. The team decided not to move on with implementing this; while the research proves this can be done, the trade-off comes at the cost of a more permissive model for the proxy.

Put plainly, in order to set the IP_TRANSPARENT option on the socket (proxy <> app), we'd have to allow the proxy to run with NET_ADMIN capabilities. We don't particularly feel this is a good model, and it's unfortunately required by the kernel. Without IP_TRANSPARENT, we can't "spoof" the client IP on the proxy side by binding to the original address before connecting to the application.

This solution might work in other use cases; it would make sense to have it on ingress traffic, but as it stands, it's not feasible to have for inter-cluster communication.

I'll be closing the issue, the research is there if circumstances change in the future.

linkerd / linkerd2