h2o / h2o

H2O - the optimized HTTP/1, HTTP/2, HTTP/3 server
https://h2o.examp1e.net
MIT License
10.83k stars 839 forks source link

RFC: Transparent Reverse Proxy #2416

Open cwyang opened 4 years ago

cwyang commented 4 years ago

Hi, all.

I'm working on following issue and PR is not far away. I'd be happy to get any comments from you.

Best regards, Chul-Woong

--

tproxy: Transparent Reverse Proxy

Transparent proxy accepts non-local connection and serves HTTP requests with the corresponding upstream server.

This patch brings transparent proxy feature to keep source IP address untouched for upstream connection.
Using source IP address to identify user for auditing and analyzing (for business and security)
is preferable over utilizing X-forwarded-for or using PROXY protocol, when an operation does not own
the implementation of participating appliances and services, which I believe true for most enterprise use case.

Second feature of this patch is proxying connections to multiple services.
That is, H2O can do not only reverse proxying to predetermined servers, but also proxying all
incoming sessions with proposed `tproxy` target.
H2O can sit in the middle of networks and can do protocol changing or header modification.
Contents caching is an future extension we can think of, say, Squid speaking HTTP/2.

(1) Configuration
listen:tproxy:    H2O listens socket with IP_TRANSPARENT and accepts non-local connections. (def:OFF)
proxy.tproxy:     Source IP address is spoofed to client IP address for proxy connections.  (def:OFF)
proxy.reverse.url `tproxy` target:
                  Original destination IP{/port} is used for proxy connections. For example:
                  proxy.reverse.url: "https://tproxy/" connects to orgdst. with https.
                  proxy.reverse.url: "http://tproxy:80/" connects to orgdst IP address on port 80, with http.

Use `proxy.preserve-host: ON` for `tproxy` target,

IP_TRANSPARENT needs CAP_NET_ADMIN capability to work.
Run as root or grant the capability with `setcap cap_net_admin+ep h2o`.

(2) Connection Pool / Socket Pool Management. (`lib/handler/proxy/tproxy.c`)
We keep per-client connpools and sockpools.
We use h2o_cache to store connpools and sockpools for each session (client IP, dst IP/Port).
The connpool/sockpool semantic left untouched. Connpools are kept for each thread while
Sockpools are shared between threads. So the httpclient code part is mostly unchanged.

Unused connpools are checked for removal at cache access(`purge()@cache.c`).
Used connpools are given another fresh live duration, set with `proxy.pool_duration`,
because cache is built with cache->flags `H2O_CACHE_FLAG_AGE_UPDATE`.

Note that tproxy socketpool_target is copied shallow from parent socketpool target.

(3) A note on `tproxy` target location
Since `proxy.reverse.url` is pathconf, we put `tproxy` target under Hosts:paths.
For transparent proxy operation, the proxy does not need to take account for received `Host` header.
We just connect to original destination and use received `Host` with `proxy.preserve-host: ON`

(4) Sample Configuration

Hosts:
  "master":
    listen:
      port: 8081
      tproxy: ON
      ssl:
        certificate-file: etc/wildcard-cert.pem
        key-file: etc/wildcard-cert.pem
    paths:
      /:
        proxy.reverse.url: "https://tproxy/"         # HTTPS - HTTPS tproxying
proxy.tproxy: ON
proxy.ssl.verify-peer: OFF
proxy.preserve-host: ON

(5) Sample iptables configuration to redirect 443 traffic to 8081 port

root $ iptables -t mangle -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DIVERT     tcp  --  anywhere             anywhere             socket
TPROXY     tcp  --  anywhere             anywhere             tcp dpt:https TPROXY redirect 0.0.0.0:8081 mark 0x1/0x1

...

Chain DIVERT (1 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK set 0x1
ACCEPT     all  --  anywhere             anywhere
kazuho commented 4 years ago

This is an interesting topic. Thank you for opening the issue.

Before going into the details, I would like to first understand what the scope and the intended user of the proposal are.

IIUC, IP transparency is a transport concept, meaning that it would be a hop-by-hop concept in HTTP. That means that h2o can be transparency to either the backend server, the client, or both.

I can see your argument that some legacy HTTP servers, deployed as backend servers, might want to see the client's address. If that is the intend use-case, I am not against having such capability, assuming that the changes would be small and easy to maintain. Though honestly speaking, I tend to think that it would be easier to replace those outdated backend servers (or fix the outdated application logic running there).

OTOH, I'm not sure if there is an argument regarding why h2o should be transparent to the client. Unlike squid, h2o is not a forward proxy. It does not have the capability of sending requests to arbitrary servers around the internet. Besides, I am not sure if we would be interested in implementing HTTP-level forward proxy, as the primary use-case of such middleboxes is to invade privacy of the users. We support the industry-wide effort to make security an end-to-end feature, we have argued against something like ELTS.

If the intent is the latter, I would appreciate it if you could provide a compelling use-case that does not have negative impact on privacy.

cwyang commented 4 years ago

I'm glad you have interest on this topic.

If we give H2O transport layer(L4) transparency, it's transparent to both end or not. It's hard to run one-side transparency. When a client connect to H2O, he(/she) knows he connects to H2O, and upstream server knows H2O connects to them. The upstream server find out who the original client is by X-forwarded-for or other L7 information. When a client connects to a server and H2O intercepts it transparently, he does not know H2O got his connection on L4, and the server does not know H2O is involved or not. Of course, the client can know MITM happened by TLS certificate mechanism on L7. That's what TLS is made for!

Main use case proposed is to maintain client-ip addresses to end-to-end session and let the device use that information. Security appliances, especially, like IPS/IDS, Firewall, and many others does not deal with L7 information. For example, Security team want IPS to block malware traffics with threat intelligence DB, and for that mostly transport level IP address matching happens for performance.

To summarize, my intention is the former. When H2O does reverse proxy transparently, one can extend it to have more useful feature, like caching or edge computing, which is hot topic in 5G/6G.

kazuho commented 4 years ago

If we give H2O transport layer(L4) transparency, it's transparent to both end or not. It's hard to run one-side transparency.

I'm not sure if that's correct.

Assuming that I understand correctly, you can create a client-side of a connection (that goes from h2o to the backend server) by following step 1 of https://www.kernel.org/doc/Documentation/networking/tproxy.txt and then calling connect(2). Separately, h2o can accept a connection (that originates from the client) by following step 1 and 2, then calling accept.

Main use case proposed is to maintain client-ip addresses to end-to-end session and let the device use that information. Security appliances, especially, like IPS/IDS, Firewall, and many others does not deal with L7 information. For example, Security team want IPS to block malware traffics with threat intelligence DB, and for that mostly transport level IP address matching happens for performance.

I do not follow this argument. As we are talking about reverse proxying, it is my understand that IPS/IDS, Firewall that you are describing will be used for blocking the traffic from malicious clients. As H2O is deployed as a reverse proxy, and terminates the TLS connection, the administrator would intentionally setup such a deployment, and create a mapping h2o.conf that ties the certificate and the IP address of the backend server. Then, the question is why can't the DNS entry of that application point to the address of H2O, being the router? There is no benefit in exposing the IP address of the backend server (that you are trying to protect) to public.

cwyang commented 4 years ago

If we give H2O transport layer(L4) transparency, it's transparent to both end or not. It's hard to run one-side transparency.

I'm not sure if that's correct.

Assuming that I understand correctly, you can create a client-side of a connection (that goes from h2o to the backend server) by following step 1 of https://www.kernel.org/doc/Documentation/networking/tproxy.txt and then calling connect(2). Separately, h2o can accept a connection (that originates from the client) by following step 1 and 2, then calling accept.

With your words I found that I never thought about that combination. The benefit of transparent deployment is "just plug and play. you need not change anything", so I took granted two-way transparency. And that is my main use case for this patch.

However, as you said, transparency to only "server" seems feasible by using source IP spoofing only. Proxying need 3 parts to operate "fully" transparent: (1) listening nonlocal socket (2) source-IP address spoofing (3) using original destination IP address. Each part can be independent to others. In my implementation, I've done (1) and (2,3) as separate feature. But you say that just providing (2) can be a feature. I wonder I'm following you up. If then, I can keep all features separately and operators can choose and combine them for their use.

You can see WIP patch here: https://github.com/cwyang/h2o/commit/1be6097e3ccaefd3a5d98eb556579f2f9510e94e

Main use case proposed is to maintain client-ip addresses to end-to-end session and let the device use that information. Security appliances, especially, like IPS/IDS, Firewall, and many others does not deal with L7 information. For example, Security team want IPS to block malware traffics with threat intelligence DB, and for that mostly transport level IP address matching happens for performance.

I do not follow this argument. As we are talking about reverse proxying, it is my understand that IPS/IDS, Firewall that you are describing will be used for blocking the traffic from malicious clients. As H2O is deployed as a reverse proxy, and terminates the TLS connection, the administrator would intentionally setup such a deployment, and create a mapping h2o.conf that ties the certificate and the IP address of the backend server. Then, the question is why can't the DNS entry of that application point to the address of H2O, being the router? There is no benefit in exposing the IP address of the backend server (that you are trying to protect) to public.

Oh, I fear my explanation was not enough. I totally agree with you that keeping the server private is beneficial. That's one of the main reasons we deploy reverse proxy up front and make it do a frontline defense. By the way, many of the enterprise customer already deploys frontline defense. What I want to say is not "exporting backend server IP is needed", but "customers(operators) favors transparent deployment and exposing upstream IP address to client is not a problem usually since it's been exposed to client already with or without transparent deployment of the server we talked about".

Let's see following pic:

(user) -- (internet) --- (dmz) -- (fw) --(a)-- (reverse proxy) -- (backend server)

When some functions, whatever they are, are called for at (a) part, we can give users more value if we can deploy the function transparently, since we can just plug-and-play. It does not mean that we keep backend server up front. Users just want to add a function to already settled environment without any (or little) change to their environment.

To summarize:

Warm regards,

kazuho commented 4 years ago

When some functions, whatever they are, are called for at (a) part, we can give users more value if we can deploy the function transparently, since we can just plug-and-play. It does not mean that we keep backend server up front. Users just want to add a function to already settled environment without any (or little) change to their environment.

Yeah, my point is that it does require a configuration change, because H2O is not a forward proxy. Because H2O is a reverse proxy, you have to add a mapping to H2O, that maps a specific hostname to a specific backend. I am not sure how much I buy the argument that keeping the original IP address of the backend server is a benefit.

All that said, I would not be against adding code that allows H2O to accept connections to a non-local address if the change is going to be tiny and isolated.

However, as you said, transparency to only "server" seems feasible by using source IP spoofing only. Proxying need 3 parts to operate "fully" transparent: (1) listening nonlocal socket (2) source-IP address spoofing (3) using original destination IP address. Each part can be independent to others. In my implementation, I've done (1) and (2,3) as separate feature. But you say that just providing (2) can be a feature. I wonder I'm following you up. If then, I can keep all features separately and operators can choose and combine them for their use.

I think my point is that the following two are separate:

These two are transport-level concepts, and they can be added to our socket layer. Then, we can build HTTP-level features on top of it.

cwyang commented 4 years ago

Yeah, my point is that it does require a configuration change, because H2O is not a forward proxy. Because H2O is a reverse proxy, you have to add a mapping to H2O, that maps a specific hostname to a specific backend. I am not sure how much I buy the argument that keeping the original IP address of the backend server is a benefit.

When a H2O-based transparent reverse proxy, which I'm proposing, goes to operation, we don't have to change configuration of already deployed systems. Yes, we have to set-up the transparent reverse proxy like adding mappings. My proposed proxy.reverse.url tproxy target like https://tproxy/ which uses original destination IP address to connect to backend can be used to reduce the set-up overhead on the proxy. I'm afraid it's some kind of weird syntax, though.

All that said, I would not be against adding code that allows H2O to accept connections to a non-local address if the change is going to be tiny and isolated.

I think my point is that the following two are separate:

  • connecting to backend servers from a specific source address
  • accepting connections on a non-local address

These two are transport-level concepts, and they can be added to our socket layer. Then, we can build HTTP-level features on top of it.

OK. I'll prepare PR.

Accepting connection to a non-local address is simple. However, connecting to backend server with source IP spoofing is not that straightforward since connection pool and socketpool must be managed not only for backend but also for each client IP address. I think I can manage it but to meet established code quality, your guide will be invaluable.

Originally I was preparing the feature as one PR since I think it was one kind of feature. But I think it's better to split it to logical sub-features. How should I prepare PR? Should I split a PR to several ones? or keep one PR but split commits for each logical sub-features?

Thank for you kind comments. Best-regards,

kazuho commented 4 years ago

Regarding configuration, I think it can be an option of the reverse proxy configuration, much like how we support the PROXY protocol.

As the way of supporting the PROXY protocol, we already have the mechanism of passing the 4 tuple through the proxy handler to httpclient, that relies on the expectation that the connection will be opened for each HTTP request (see the last lines of lib/core/proxy.c).

I tend to think that the only exception would be when to pass the 4-tuple to the http client (and to the socket layer that would use those values to call bind, connect).