wsgiref simple_server PATH_INFO treats slashes and %2F the same

BPO	28355
Nosy	@pjeby

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug'] title = 'wsgiref simple_server PATH_INFO treats slashes and %2F the same' updated_at = user = 'https://bugs.python.org/tdammers' ``` bugs.python.org fields: ```python activity = actor = 'ned.deily' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'tdammers' dependencies = [] files = [] hgrepos = [] issue_num = 28355 keywords = [] message_count = 1.0 messages = ['278032'] nosy_count = 2.0 nosy_names = ['pje', 'tdammers'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28355' versions = ['Python 3.4'] ```

The WSGI reference implementation does not provide any means for application code to distinguish between the following request lines:

GET /foo/bar HTTP/1.1

GET /foo%2Fbar HTTP/1.1

Now, the relevant RFC-1945 (https://tools.ietf.org/html/rfc1945#section-3.2) does not explicitly state how these should be handled by application code, but it does clearly distinguish encoded from unencoded forward-slashes in the BNF, which suggests that percent-encoded slashes should be considered part of a path segment, while unencoded slashes should be considere segment separators, and thus that the first URL is supposed to be interpreted as ['foo', 'bar'], but the second one as ['foo/bar']. However, the 'PATH_INFO' WSGI environ variable contains the same string, '/foo/bar', in both cases, making it impossible for application code to handle the difference. I believe the underlying issue is that percent-decoding (and decoding URLs into UTF-8) happens before interpreting the 'PATH_INFO', which is unavoidable because of the design decision to present PATH_INFO as a unicode string - if it were kept as a bytestring, then interpreting it would remain the sole responsibility of the application code; if it were a fully parsed list of unicode path segments, then the splitting could be implemented correctly.

Unfortunately, I cannot see a pleasant way of fixing this without breaking a whole lot of stuff, but maybe someone else does.

It's also very possible that I interpret the RFC incorrectly, in which case please enlighten me.

python / cpython

wsgiref simple_server PATH_INFO treats slashes and %2F the same #72541