Closed mateiidavid closed 2 years ago
As an update, when I worked on the proof of concept I was curious if we can preserve the source IP of the request by doing all packet routing at the NAT level, instead of relying on mangling headers and using the TPROXY
module. Doing everything at NAT still works, provided the proxy still marks the packets and sets IP_TRANSPARENT
on the socket.
The benefits of doing this at the NAT level:
proxy-init
: we will only need to route marked packets to make sure they stay local
(in other words, we need to add 2 rules to make sure response packets from proxy <--> local process will stay local instead of going to the source address).The main disadvantage is that source nat can happen when we do destination nat IFF two connections to the same target end up using the same srcIP and srcPORT in the TCP connection identifying 5 tuple. This seems pretty improbable to me though, and in this case everything would still work, we just wouldn't have the source IP preserved. We can handle this case in the proxy through log messages though.
The main disadvantages of relying on TPROXY
and mangle:
Would love some opinions on this if you have the time.
Update:
We had some discussions off GitHub about this recently. The team decided not to move on with implementing this; while the research proves this can be done, the trade-off comes at the cost of a more permissive model for the proxy.
Put plainly, in order to set the IP_TRANSPARENT
option on the socket (proxy <> app), we'd have to allow the proxy to run with NET_ADMIN
capabilities. We don't particularly feel this is a good model, and it's unfortunately required by the kernel. Without IP_TRANSPARENT
, we can't "spoof" the client IP on the proxy side by binding to the original address before connecting to the application.
This solution might work in other use cases; it would make sense to have it on ingress traffic, but as it stands, it's not feasible to have for inter-cluster communication.
I'll be closing the issue, the research is there if circumstances change in the future.
This issue outlines the concepts involved in adding support for
TPROXY
, and proposes changes that would enable this to work with Linkerd. This was initially requested in #4713, and I know some members of the community would be glad to have it in.The proposal leverages
TPROXY
and socket optionIP_TRANSPARENT
to preserve the IP of a client for any TCP connections. A great advantage of doing this at the firewall level is that we do not have to set any special request headers (e.gX-FORWARDED-FOR
); this would be application protocol agnostic. To make this work, there are two parts:iptables
.Note: TCP connections are identified through a 5-tuple:
(src_ip, src_port, protocol, dst_ip, dst_port)
. When relying on nat at a firewall level to re-write the destination (to the proxy's port), in general the source IP will be preserved. In thePREROUTING
chain, it is common to only do DNAT. However, if two connections to the same host share the same IP and port, it is possible to have SNAT done in the prerouting step. Theoretically, we might be able to do this without tproxy; it does introduce some guarantees that are good to have despite the complexity.Update: I created a proof of concept with how this would work in k8s. The proof of concept will preserve the src IP of the client; it can do so at a
nat
level (with current ipt rules we have in place) or at amangle
level using tproxy target. You can find the proof of concept hereTPROXY support
Introduction. Tproxy refers to a module that adds transparent proxy support to the kernel[1]. In essence, it allows to proxy traffic from a client to a server through a local socket; as far as the client is concerned, it connected successfully to the original target. The documentation outlines the steps to make this work[1]:
IP_TRANSPARENT
option (this let's the socket bind to a non-local address).TPROXY
target in iptables to have traffic routed through the socket mentioned in step (3) instead of going directly to the original dst.Essentially, this was designed to intercept traffic on a router, even if the destination is not local, we can "impersonate it" through a local socket that has the
IP_TRANSPARENT
option set. This works in a similar way to theREDIRECT
target, which we make use of at the moment.Before we go on, there are some netfilter concepts to revisit.
This is a short catch-up on some iptables concepts. In total, there are four predefined tables, traversed in order: raw, mangle, nat, and filter. There are also a set of predefined chains that a packet will go through, these chains are associated with the tables. A chain may belong to multiple tables, but multiple tables may not have the same chain. For tproxy, we are concerned primarily with the `mangle` table (which will be traversed before `nat`). Mangle, as the name implies, allows us to mangle packets -- this generally means changing the IP header -- but it has another advantage. The mangle table can use the `MARK` target, which was briefly mentioned in the introduction. Simply put, we can mark certain packets with an arbitrary value (note though, the mark is something the kernel tracks internally, it's not _ON_ the packet). Marks are generally used with policy based routing at a firewall level[[2]]. We can essentially create our own routes and our own routing table and apply this "policy" only to certain packets that are marked. This is important to know for most of this stuff to make this. Setting marks is only supported in the mangle table[[2]]. Lastly, the mangle table has the same two chains we have been operating on using nat: `PREROUTING` and `OUTPUT`.Changes required. The set-up itself isn't very complicated, we need to set-up new routing rules and intercept packets with a socket that has the
IP_TRANSPARENT
option. Since we are only concerned with this on the inbound side, my feeling is that we can still rely on nat to proxy requests through the outbound side. Inbound, we would have to do add a few rules:1
.0.0.0.0/0
) will be treated as local.iptables -t mangle -A PREROUTING -p tcp -j TPROXY --tproxy-mark <0x1, our mark> --on-port <port>
.TL;DR. Tproxy works like nat, but with some extra steps. Its advantage is that it does not re-write the destination address, and it is also more reliable in the face of SNAT (which may happen if 2 connections use the same src port and ip, for whatever reason). To use the
TPROXY
target in iptables, we first need to mark inbound packets and then route them using policy routing -- this will ensure all packets are treated as local, instead of being forwarded. I'd argue this is not strictly necessary in k8s, since we will never act as a router (dst will always be local to proxy), but it seems to be the norm. In essence, all destination addresses will be treated as local (if you're confused, welcome to the club).By re-using the mark, we can tell iptables to send packets to our port instead of their original destination. Neat. We are missing one step though, we have to actually impersonate the original destination.
If you're confused about routing tables, click me.
These are the two commands that the tproxy docs lists: ``` # ip rule add fwmark 1 lookup 100 # ip route add local 0.0.0.0/0 dev lo table 100 ``` Like I said before, a mark is just a field maintained in the kernel and associated with a specific packet. When a packet comes in, after it traverses PREROUTING, a routing decision has to be made. The kernel consults its routing policy database. These two commands will say, for any packet marked as 1, lookup table 100 (it can be any name). The second command adds a new routing rule to the database, it says: the scope of this route is local, it applies to any IPV4, if you see it, it has to be handled on the loopback interface; i.e it _never_ leaves the host.IP_TRANSPARENT
Introduction. To successfully route traffic using the tproxy target, we also need to set the
IP_TRANSPARENT
option on the socket. The documentation is quite confusing for it[3], it allows us to bind to a non-local IP address. As we will see, this goes two ways; inbound and outbound.Inbound. We need to set the option on our server socket (not the app, the proxy's server) so that we may receive traffic routed through tproxy. The
local_addr
on this socket will be the application's IP:PORT! For example, say we have our server bound on0.0.0.0:5000
and our app on0.0.0.0:3000
. We set tproxy to redirect all packets to 5000. When a client sends a request to10.0.0.1:3000
, our proxy will receive it. As far as all parties are concerned, however, they think the local address is10.0.0.1:3000
but it is infact10.0.0.1:5000
.Outbound: when we open a connection from the proxy to the application, we need to again use this
IP_TRANSPARENT
option; the usecase here is a bit more subtle perhaps. We have access to the peer address from our server socket, we can re-use that and bind before connect. The peer address is not local, butIP_TRANSPARENT
will let us bind to it anyway.The actual complication comes from what the application perceives as being the client. It thinks it talks to the client, so any reply packets will be sent to the client. We need some additional rules in place to route replies back through our proxy. This is where the
CONNMARK
target comes into place. When we build the socket, aside from the transparent option, we also set the same mark on it as all other packets. We then add a rule to each chain: PREROUTING and OUTPUT. On the PREROUTING side, we add a CONNMARK rule whereby packets turn their mark into a connection mark. On the OUTPUT side, we restore the connection mark back into a packet mark so policy routing can be applied.Implementation and scoped work
I wanted to give a thorough introduction on how everything works conceptually. To illustrate all of this in practice, I created a proof of concept project: https://github.com/mateiidavid/linkerd-tproxy-poc.
In the poc, we have a server and a "proxy". The proxy will set-up iptable rules and intercept traffic from a client (
nc
orcurl
) and then send requests to the server using a spoofed address. This should show how things come together. The proxy talks to the server over localhost for simplicity.For Linkerd:
SO_ORIGINAL_DST
on the socket (we can get the address directly from the socket since packet is not re-written by nat); not compulsory, we can still keep the logic in. We'd have to set different socket options though, most notablyIP_TRANSPARENT
. In the poc I have also usedfreebind
andreuseaddr
. This is a good article on the topic. Much of this set-up is actually inspired by mmproxy.iproute2
to configure policy routing.The surface area of the change is not exessively large, I think we can get away with only adding inbound rules (and some additional output rules for connmarking) on the iptables side. We can feature flag this in the helm charts with
--set initContainer.tproxy=true
or something similar. Setting this should be reflected in the proxy template partial; my idea is to have an environment variable that let's the proxy know it has to set-up the connections differently. The code will make use of this env variable to add/remove socket options.I'll continue updating this as discussions go on, the next step for me would be to add a checklist to this issue that will track all necessary work. I'd like to first socialize the idea and see what other people think about the proposal. I'll also update this post with more explanations if needed.
References
1: tproxy docs 2: Mark target 3: man ip
Further reading: Tproxy proof of concept in C Tproxy proof of concept in Rust, made with Linkerd in mind Cloudflare blog on tproxy mmproxy: preserving src IP