Open skalee opened 7 years ago
Great observation @skalee ! I think we should take the rack-attack
approach to allow these cases and their responses (1-4) to be configurable.
But which one is most desired for now? I suppose 4.
Some questions:
@ribose-jeffreylau
How does 4 work?
For example:
enc = 'US-ASCII'
some_string.encode(enc, enc, invalid: :replace, undef: :replace, replace: "%25")
But I can't say I've thought about it thoroughly already. Here I assume headers should be encoded in US-ASCII which I suppose is standard, though I need to check that later. Anyway, I am open for comments. I agree that perhaps in some situations different replacement strings are suitable. Fortunately, the String#encode
method is quite elastic.
Also, by nullifying, do you mean replacing the offending characters with empty string?
I mean assigning nil or removing that key from env (if possible), but replacing with empty string might be a better idea, need to think about it.
Thanks @skalee .
Here I assume headers should be encoded in US-ASCII which I suppose is standard
I just went and checked --- It seems the way to do it is to leave them alone, if I understand correctly what "opaque data" means in RFC 7230 section 3.2.6:
Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
🤔
AFAIU it's about newly defined (that is, non-standard) fields, as HTTP 1.0 was more elastic in this matter. Broader quote from page 25:
Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
Also, p. 19,
The normal procedure for parsing an HTTP message is to read the start-line into a structure, read each header field into a hash table by field name until the empty line, and then use the parsed data to determine if a message body is expected. If a message body has been indicated, then it is read as a stream until an amount of octets equal to the message body length is read or the connection is closed.
So that other comment basically extends what's on p. 19. First parse the message, build a hash table which associates header names with their values, then interpret parsed strings. And if header value contains some characters disallowed in HTTP 1.1 (but allowed in 1.0), treat them as opaque data (do not interpret at all).
And here we're talking about several standard header fields, each containing an URL, or segment of it.
@ronaldtse @ribose-jeffreylau At the moment, the Rack::Cleanser does:
- Render 400, 404 or some other client error response
And for now I'm going to replace it with:
- Fix it by replacing problematic characters with some predefined replacement string
I don't think it's a good idea to return HTTP 404 when, for instance, HTTP_REFERER has some percent-encoded characters from some charset other than UTF-8.
Sure. It would be even better if we make it customizable with some DSL 👍
Obviously, but not now.
@ronaldtse could you push v0.2.0 to Rubygems? I don't have privileges to do so. GitHub release already created.
Sorry, I meant the other gem.
Ahh I tried to push but the version was unchanged. Now I know why 😉
At the moment we care about parameters encoding only, and the middleware is expected to return HTTP 404 when params have encoding broken. But in Rack::Cleanser we cleanse other headers as well, the complete list is HTTP_REFERER, PATH_INFO, QUERY_STRING, REQUEST_PATH, REQUEST_URI, HTTP_X_FORWARDED_HOST. So, what should happen if, for instance, HTTP_REFERER has encoding broken?
Also, I guess, invalid encoding may be resulting from:
"\xc2\xa1"
is a valid UTF-8 string whereas"\xc2"
is not, nor"\xa1"
)Should I treat such cases differently?