How to handle incorrectly-encoded headers, like HTTP_REFERER?

skalee commented 7 years ago

At the moment we care about parameters encoding only, and the middleware is expected to return HTTP 404 when params have encoding broken. But in Rack::Cleanser we cleanse other headers as well, the complete list is HTTP_REFERER, PATH_INFO, QUERY_STRING, REQUEST_PATH, REQUEST_URI, HTTP_X_FORWARDED_HOST. So, what should happen if, for instance, HTTP_REFERER has encoding broken?

Render 400, 404 or some other client error response?
Nullify it?
Leave it as it is?
Fix it by replacing problematic characters with some predefined replacement string?

Also, I guess, invalid encoding may be resulting from:

Unescaped unicode character
Incomplete byte sequence (like "\xc2\xa1" is a valid UTF-8 string whereas "\xc2" is not, nor "\xa1")

Should I treat such cases differently?

ronaldtse commented 7 years ago

Great observation @skalee ! I think we should take the rack-attack approach to allow these cases and their responses (1-4) to be configurable.

skalee commented 7 years ago

But which one is most desired for now? I suppose 4.

ribose-jeffreylau commented 7 years ago

Some questions:

How does 4 work? e.g. in the case of "incomplete byte sequence". I'd imagine there'd be multiple scenarios where different replacement strings would be more suitable. Should we provide a set of default replacement strings?
Also, by nullifying, do you mean replacing the offending characters with empty string?

skalee commented 7 years ago

@ribose-jeffreylau

How does 4 work?

For example:

enc = 'US-ASCII'
some_string.encode(enc, enc, invalid: :replace, undef: :replace, replace: "%25")

But I can't say I've thought about it thoroughly already. Here I assume headers should be encoded in US-ASCII which I suppose is standard, though I need to check that later. Anyway, I am open for comments. I agree that perhaps in some situations different replacement strings are suitable. Fortunately, the String#encode method is quite elastic.

Also, by nullifying, do you mean replacing the offending characters with empty string?

I mean assigning nil or removing that key from env (if possible), but replacing with empty string might be a better idea, need to think about it.

ribose-jeffreylau commented 7 years ago

Thanks @skalee .

Here I assume headers should be encoded in US-ASCII which I suppose is standard

I just went and checked --- It seems the way to do it is to leave them alone, if I understand correctly what "opaque data" means in RFC 7230 section 3.2.6:

Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

🤔

skalee commented 7 years ago

AFAIU it's about newly defined (that is, non-standard) fields, as HTTP 1.0 was more elastic in this matter. Broader quote from page 25:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

skalee commented 7 years ago

Also, p. 19,

The normal procedure for parsing an HTTP message is to read the start-line into a structure, read each header field into a hash table by field name until the empty line, and then use the parsed data to determine if a message body is expected. If a message body has been indicated, then it is read as a stream until an amount of octets equal to the message body length is read or the connection is closed.

So that other comment basically extends what's on p. 19. First parse the message, build a hash table which associates header names with their values, then interpret parsed strings. And if header value contains some characters disallowed in HTTP 1.1 (but allowed in 1.0), treat them as opaque data (do not interpret at all).

And here we're talking about several standard header fields, each containing an URL, or segment of it.

skalee commented 7 years ago

@ronaldtse @ribose-jeffreylau At the moment, the Rack::Cleanser does:

Render 400, 404 or some other client error response

And for now I'm going to replace it with:

Fix it by replacing problematic characters with some predefined replacement string

I don't think it's a good idea to return HTTP 404 when, for instance, HTTP_REFERER has some percent-encoded characters from some charset other than UTF-8.

ronaldtse commented 7 years ago

Sure. It would be even better if we make it customizable with some DSL 👍

skalee commented 7 years ago

Obviously, but not now.

skalee commented 7 years ago

@ronaldtse could you push v0.2.0 to Rubygems? I don't have privileges to do so. GitHub release already created.

Sorry, I meant the other gem.

ronaldtse commented 7 years ago

Ahh I tried to push but the version was unchanged. Now I know why 😉

riboseinc / rack-cleanser

How to handle incorrectly-encoded headers, like HTTP_REFERER? #20