python-web-sig / wsgi-ng

Working group for wsgi-ng
39 stars 3 forks source link

REMOTE_ADDR invalid when running a domain socket server #11

Open rbtcollins opened 9 years ago

rbtcollins commented 9 years ago

Its legitimate to offer HTTP over e.g. domain sockets, but REMOTE_ADDR is not currently defined for that situation.

unbit commented 9 years ago

Technically (unless some very specific scenario that i do not think matters here) when you have an HTTP server bound on UNIX sockets, you have a fronted proxy/router bound on AF_INET that has the REMOTE_ADDR information/value.

There are various ways used by proxies to pass this value, the classic X-Forwarded-For, the rfc 7239 and other custom headers.

My proposal is to "suggest" the server developers to respect those headers (at least rfc 7239). In case those known headers are missing we could give a "default rule" of having that value empty (so the server can eventually reject it or assign a fake value like 0.0.0.0).

rbtcollins commented 9 years ago

http://tools.ietf.org/html/rfc7239 - the standardised replacement for X-Forwarded-For offers a clean interface we could use to address this.

rbtcollins commented 9 years ago

http://www.ietf.org/rfc/rfc3875 is fairly clear that REMOTE_ADDR is the directly connected client; I think we should make sure there is a library available to use RFC7239 and not do anything special in WSGI for it - other than noting that REMOTE_ADDR may be blank to indicate 'no client address' (or perhaps we should use unix: to indicate that its a unix process?)

rbtcollins commented 9 years ago

Oh, an alternative would be to set REMOTE_ADDR=127.0.0.1 which for most OS's is arguably correct: domain sockets are localhost only. OTOH I think clarity here is better than guessing.

GrahamDumpleton commented 9 years ago

I have always felt that using REMOTE_ADDR=127.0.0.1 if a request came over a UNIX socket connection was the safest option if the WSGI server didn't have other trusted information so as to allow it to be overridden with the true information available from the front end on the same system which is proxying to it.

Although 0.0.0.0 is attractive as being distinguishable, I am concerned about what people may be doing with this value beyond straight comparison against white/black lists.

Reverse lookup on 0.0.0.0 will generally fail where as 127.0.0.1 will at least return something.

$ host 127.0.0.1
1.0.0.127.in-addr.arpa domain name pointer localhost.
$ host 0.0.0.0
0.0.0.0.in-addr.arpa has no PTR record

I have noted that use of 0.0.0.0 in other situations doesn't always work the same across system types as well.

Do we have a good idea what people do with REMOTE_ADDR beyond literal comparison or pattern matching of some sort, such as glob or subnet class matching.

In mod_wsgi daemon mode, where requests are accepted over a UNIX socket connection from the Apache worker processes with its own protocol, because mod_wsgi handles both ends of the connection it is able to ensure that the value is set correctly as the value of REMOTE_ADDR will be that from the Apache worker process which originally accepted the original request.

Other systems using FASTCGI and SCGI would be the same and should preserve REMOTE_ADDR. I expect that uWSGI would be the same when using the uwsgi protocol (yes/no).

So I assume the specific problem case is restricted to the feature of some servers to transfer the original HTTP request to a secondary process via a UNIX socket. For example mod_proxy_fdpass in Apache 2.4 and similar features in nginx, or using something like Circus.

Can we list some more specific examples/configuration scenarios, with names of servers and protocols involved, where this is actually an issue so we know for sure what we are trying to address the problem for.

unbit commented 9 years ago

@GrahamDumpleton yes any CGI-based network protocol (SCGI, FastCGI, uwsgi) should be safe as the REMOTE_ADDR is already built by the webserver. My only fear with mapping REMOTE_ADDR to 127.0.0.1 is that lot of users use REMOTE_ADDR as a first-level authorization. We should signal them in some way that this value should not be trusted.

GrahamDumpleton commented 9 years ago

@unbit But do you actually know though of real world examples of what people do with REMOTE_ADDR beyond literal comparison or pattern matching of some sort, such as glob or subnet class matching.

We are discussing a solution to a problem which we don't have documented examples so far that I can see of what people do with it to allow us to best know what the implications may be of either:

Does anyone know what major web frameworks like Django, Werkzeug/Flask, Pyramid etc do with REMOTE_ADDR if anything?

Does anyone have actual examples of what custom user WSGI applications may do?

I would like to see examples so can gauge the impact of any suggestion. You may have seen such examples, but I haven't and others may not have either, so having them will help various people I am sure.

If you can also provide a simple example of how nginx/uWSGI might be used where this is all an issue that would also be useful. This way anyone not familiar with scenarios where it can be a problem can experiment with a setup if they feel like it.

rbtcollins commented 9 years ago

I don't have examples to hand. The nginx thing is this: you can configure nginx to backend to a unix domain socket for its various things: http://nginx.org/en/docs/http/ngx_http_upstream_module.html - and when this is done there's no signalling at that layer of the remote_addr - but protocols like fastcgi and uwsgi can pass arbitrary cgi variables across, and nginx has a syntax for specifying them. So it may well depend on the very specific backend that is in use - http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_pass for instance I'd expect to fall down in this regard, but uwsgi or fastcgi should be fine.

unbit commented 9 years ago

Honestly it looks to me the use of REMOTE_ADDR is everywhere in all of the major frameworks. Logging, security and accounting policies can all be based on it. As an example, Django exposes it as request.remote_addr and the value is used by the debug toolbar to authorize its use. Frankly it is one of those field for which security (IMHO) must be the principal factor.

GrahamDumpleton commented 9 years ago

So when I say examples, I mean actually explicit examples of how used in the code. So for Django debug toolbar uses:

def show_toolbar(request):
    """
    Default function to determine whether to show the toolbar on a given page.
    """
    if request.META.get('REMOTE_ADDR', None) not in settings.INTERNAL_IPS:
        return False

    if request.is_ajax():
        return False

    return bool(settings.DEBUG)

where it is checking member ship in a sequence of some sort.

In this case leaving out REMOTE_ADDR or setting it to None would be a totally fine solution.

What would be bad though in the case of Django debug toolbar would for it to use 127.0.0.1 as a fallback as for Django debug toolbar 127.0.0.1 is the exact IP that would be in the allowed client list. If REMOTE_ADDR was set to 127.0.0.1 for a UNIX socket, you have then inadvertently allowed anyone to access the debug toolbar.

In other words, this rules out using 127.0.0.1.

Now can you see why I want to see actual code examples of how used? Vague statements saying things about how it might be used such as logging isn't quite enough.

So anyone else think of any other actual packages with use REMOTE_ADDR so we can look at some more examples.

gvanrossum commented 9 years ago

If you query stackoverflow for REMOTE_ADDR you will find a lot of relevant questions (just add 'python' to the query, otherwise most of the results are about PHP :-).

On Tue, Oct 14, 2014 at 3:46 PM, Graham Dumpleton notifications@github.com wrote:

So when I say examples, I mean actually explicit examples of how used in the code. So for Django debug toolbar uses:

def show_toolbar(request): """ Default function to determine whether to show the toolbar on a given page. """ if request.META.get('REMOTE_ADDR', None) not in settings.INTERNAL_IPS: return False

if request.is_ajax():
    return False

return bool(settings.DEBUG)

where it is checking member ship in a sequence of some sort.

In this case leaving out REMOTE_ADDR or setting it to None would be a totally fine solution.

What would be bad though in the case of Django debug toolbar would for it to use 127.0.0.1 as a fallback as for Django debug toolbar 127.0.0.1 is the exact IP that would be in the allowed client list. If REMOTE_ADDR was set to 127.0.0.1 for a UNIX socket, you have then inadvertently allowed anyone to access the debug toolbar.

In other words, this rules out using 127.0.0.1.

Now can you see why I want to see actual code examples of how used? Vague statements saying things about how it might be used such as logging isn't quite enough.

So anyone else think of any other actual packages with use REMOTE_ADDR so we can look at some more examples.

— Reply to this email directly or view it on GitHub https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-59130051 .

--Guido van Rossum (python.org/~guido)

GrahamDumpleton commented 9 years ago

Just for reference, this ticket was spawned from discussion on WEB-SIG list at root messages of:

Part of that discussion discusses use of X-Forwarded-* and Forwarded headers as means to override REMOTE_ADDR.

GrahamDumpleton commented 9 years ago

In the WEB-SIG mailing list discussion I mentioned the need for any WSGI middleware to be configurable if considering using X-Fowarded-* headers to override REMOTE_ADDR. As an example to compare against, one can look at what mod_rpaf for Apache does:

For mod_wsgi at least, using mod_rpaf is certainly one option if one wants a solution which a system administrator can implement and manage themselves without requiring a WSGI application developer to embed something in their own application.

GrahamDumpleton commented 9 years ago

As to consulting StackOverflow, the bulk of the questions are more about how to access REMOTE_ADDR for a request in a particular web framework or at the WSGI level and not what I am looking for which is how then are people using it. Am after the latter to understand what implications may be of selecting some stand in value, be it None, empty string or some other value which doesn't look like an IP.

After going through over 50 items from the search the only example have found so far is in using it for geo location.

from django.contrib.gis.utils import GeoIP

g = GeoIP()

ip = request.META.get('REMOTE_ADDR', None)

if ip:

    city = g.city(ip)['city']

print "ip:",ip," city:",city

In the GeoIP library there is one interesting check:

        ipv = 6 if addr.find(':') >= 0 else 4

This potentially tells us some things will actually assume the format is that of an IPV4 or IPV6 address.

Take now the prior suggestion of using 'unix:'.

Now I am not sure if the ':' in that was intentional, but in that check in GeoIP, that would cause it to think that it was a IPV6 address.

So using a non IPV6 or IPV4 value may be an issue.

The geo location services are actually interesting. Consider:

One would have thought that 0.0.0.0 wouldn't map to anything, but the service does actually still return data for it. Not sure if this is them just being smart in the web based interface and displaying their office location or something.

Another one wouldn't expect to get data for but do is:

When you use:

though, it doesn't display anything. Similarly:

doesn't display anything either.

This case is an interesting data point around whether one could use 0.0.0.0 in that the geo location data still shows some data.

unbit commented 9 years ago

Ok, as i was the first one eventually proposing 0.0.0.0 i completely step back as it is a truly bad idea. But we need a way to signal an "untrusted" REMOTE_ADDR

GrahamDumpleton commented 9 years ago

Am not sure that 0.0.0.0 is entirely ruled out at this point because I don't believe the geo location service should be returning anything. Looks more like they have crap data, which could always happen.

The IP address assignments are:

0.0.0.0/8 - Used for broadcast messages to the current ("this") network as specified by RFC 1700, page 4.

So 0.0.0.0 should not be bound to a specific geographic location.

Another good candidate might have been:

255.255.255.255/32 - Reserved for the "limited broadcast" destination address, as specified by RFC 6890.

That yields results from the geolocation services as well.

GrahamDumpleton commented 9 years ago

FWIW, just because 127.0.0.1 is a common one to be used in the allowed client host list for Django debug toolbar doesn't mean it is a entirely bad one to use either, especially considering that if you have a local proxy on same host using HTTP over normal INET socket connection that is exactly what you will get for that as well.

So using Django debug toolbar on a machine with a local proxy is just bad news altogether if that host is also public facing.

This is where a middleware of server module which can be configured to actually override REMOTE_ADDR may well end still being desirable either way. You just need that ability to say what proxy IP's you trust when allowing headers to override it. With multiple hops you need such a mechanism, with trust mechanisms built in at each hop, with it clearing out headers from sources you don't trust so they in turn aren't proxied through and then appear to come from a trusted proxy.

All quite a pain.

tilgovi commented 9 years ago

Came over here from the benoitc/gunicorn#797 thread. I'll add what I found when looking at Pyramid.

The debug toolbar ignores the request if REMOTE_ADDR is None. The request object has a client_addr property that's explicitly warned (in a big, red box) as potentially dangerous in the documentation. The remote_addr property gets and sets only the REMOTE_ADDR key in the environment.

gvanrossum commented 9 years ago

So how do developers typically use this? And how do they connect to the app? Via the front end or from an internal IP or logged in to the box running the app? On Oct 28, 2014 12:04 PM, "Randall Leeds" notifications@github.com wrote:

Came over here from the benoitc/gunicorn#797 https://github.com/benoitc/gunicorn/issues/797 thread. I'll add what I found when looking at Pyramid.

The debug toolbar ignores the request if REMOTE_ADDR is None. The request object has a client_addr property that's explicitly warned (in a big, red box) as potentially dangerous in the documentation. The remote_addr property gets and sets only the REMOTE_ADDR key in the environment.

— Reply to this email directly or view it on GitHub https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60812738 .

tilgovi commented 9 years ago

I think the typical use would be during development, when the server is only bound on localhost anyway. A check against a list of hosts is a paranoia feature for those who don't disable the debugger in production or who need specifically to opt in to debugging of production.

On Tue, Oct 28, 2014 at 12:42 PM, Guido van Rossum <notifications@github.com

wrote:

So how do developers typically use this? And how do they connect to the app? Via the front end or from an internal IP or logged in to the box running the app? On Oct 28, 2014 12:04 PM, "Randall Leeds" notifications@github.com wrote:

Came over here from the benoitc/gunicorn#797 https://github.com/benoitc/gunicorn/issues/797 thread. I'll add what I found when looking at Pyramid.

The debug toolbar ignores the request if REMOTE_ADDR is None. The request object has a client_addr property that's explicitly warned (in a big, red box) as potentially dangerous in the documentation. The remote_addr property gets and sets only the REMOTE_ADDR key in the environment.

— Reply to this email directly or view it on GitHub < https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60812738>

.

— Reply to this email directly or view it on GitHub https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60818505 .

gvanrossum commented 9 years ago

So what I'm trying to get at is whether (in this use case) it would be more useful if REMOTE_ADDR was set to the IP address that's directly at the other end of the socket (which in production would be some kind of reverse proxy) or whether it would be more useful if it was set to the IP address that the reverse proxy saw at the other end of its socket.

In the first case, using REMOTE_ADDR to determine whether a request is coming from a legitimate developer's machine would require the developer to contact the app directly (bypassing the reverse proxy) while in the latter case the developer would have to connect to the service via the remote proxy.

Translating this to the way we typically set up special access at Dropbox, we would almost always prefer the latter setup, since most developers don't have direct access to the prod internal network, but they do get granted privileges based (in some cases) on the originating IP address. Also, our reverse proxy do more than just load balancing, and the service would not function without the reverse proxy at all.

(And no, the way this works at Dropbox does not use CGI. :-)

On Tue, Oct 28, 2014 at 1:37 PM, Randall Leeds notifications@github.com wrote:

I think the typical use would be during development, when the server is only bound on localhost anyway. A check against a list of hosts is a paranoia feature for those who don't disable the debugger in production or who need specifically to opt in to debugging of production.

On Tue, Oct 28, 2014 at 12:42 PM, Guido van Rossum < notifications@github.com

wrote:

So how do developers typically use this? And how do they connect to the app? Via the front end or from an internal IP or logged in to the box running the app? On Oct 28, 2014 12:04 PM, "Randall Leeds" notifications@github.com wrote:

Came over here from the benoitc/gunicorn#797 https://github.com/benoitc/gunicorn/issues/797 thread. I'll add what I found when looking at Pyramid.

The debug toolbar ignores the request if REMOTE_ADDR is None. The request object has a client_addr property that's explicitly warned (in a big, red box) as potentially dangerous in the documentation. The remote_addr property gets and sets only the REMOTE_ADDR key in the environment.

— Reply to this email directly or view it on GitHub <

https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60812738>

.

— Reply to this email directly or view it on GitHub < https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60818505>

.

— Reply to this email directly or view it on GitHub https://github.com/python-web-sig/wsgi-ng/issues/11#issuecomment-60827017 .

--Guido van Rossum (python.org/~guido)

rbtcollins commented 9 years ago

I think there is a tension here between our adoption of the CGI variable definitions, and the realities of production deployments. Unless we drop our inheritance of CGI's definitions, I think REMOTE_ADDR being overridden to be something other than the remote end would be confusing at best, and a security issue at worst. REMOTE_ADDR exists to provide data that cannot be signalled via the regular protocol headers, because its not part of the protocol. Overriding it as a site-specific choice in a server stack is of course possible, and I don't see any harm in endorsing that.

But as there are (now) standards for communicating the complex setup that production environments (like Dropbox's, or python-infra's, or ...) through the various tiers, we can defer to those and don't need to reinvent them or specify a specific algorithm for servers. I hesitate to roll them into the spec, because it becomes a hard requirement for every server developer, when one could write middleware once and reuse that across servers without any impact on the protocol itself.

The key aspect that remains is what to set REMOTE_ADDR to when there is legitimately no network connection involved. note that a multiplexed domain socket like uwsgi uses is not the same as no TCP connection: there is a TCP connection to nginx in that case, and the uwsgi module is multiplexing data through to the backends over domain sockets: but its the job of the server to pass the endpoint details through appropriately. But a server running http/1.x or h2 directly on a domain socket legitimately has no TCP details to include.

Options so far: None '0.0.0.0' '127.0.0.1' '' ':unix:' or similar '255.255.255.255'

I'd like to add another: remove the variable entirely if there is no well defined value for it.

I think we want confidence that whatever we choose existing code written to the previously published spec will fail safely (that is, if it will abort requests rather than acting as a security hole).

Ideally we don't want the result to be interpretable as a legitimate network address: the example of geolocation services giving results for both 0.0.0.0 and 255.255.255.255 shows that passing bogus values onto backends may give bogus answers - and such bogus answers getting into a pipeline is often the lead-in to a security issue.

The risk of 127.0.0.1 is that naive code may assume that that is trusted, and a buggy server - that is one where the domain socket aspect is really part of an implementation thats failing to forward the actual endpoint details in the environment - triggering the use of 127.0.0.1 is then liable to expose scripts to remote endpoints that they think are local.

OTOH thats exactly the same risk we're going to introduce if we choose any magic value: once code adapts to that magic value, a misconfigured environment that claims 'no network details' when the reality is 'tunnelled over domain sockets' or similar will create the same identical security issue: if code has assumed that no details == must be local [which is a reasonable assumption].

This all leads me to say that 127.0.0.1 is an appropriate answer: if there truely are no network address details then either the connection is local (e.g. unix socket) or the server is misconfigured and failing to pass the details on (in which case the server / server glue is at fault).

benoitc commented 9 years ago

This all leads me to say that 127.0.0.1 is an appropriate answer: if there truely are no network address details then either the connection is local (e.g. unix socket) or the server is misconfigured and failing to pass the details on (in which case the server / server glue is at fault).

No not really. A unix socket is local, but the request to the server may be not if it comes from a proxy. By specifying 127.0.0.1 then the applications won't know for sure that the connection is local and could allow unwanted things. This has been discussed in benoitc/gunicorn#633 when the issue was raised for example.

In gunicorn the REMOTE_ADDR is never none if you connect to it directly using TCP. It's None if you connect to directly it via a unix socket (no client socket address) or when you connect via a proxy that doesn't provide the REMOTE_ADDR header.

Personally I don't think it should be the role of a WSGI server to add an information based on an assumption. Even if you base it on the Forwarded header, it will need to be cautious, since this header can be forced by the client.

Maybe we could add the Transport information to the environment? If the REMOTE_ADDR is None and SERVER_TRANSPORT is unix then the application could check other informations in the request. Thoughts?