HTTP 1 request headers decoded using default encoding instead of ISO-8859-1

relud commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the bug

headers are decoded here without specifying their encoding:

https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/http/http1.py#L205

On my system (osx using python 3.10.8 installed via homebrew) this causes bytes that are valid characters in ISO-8859-1 but not in UTF-8 to be decoded as surrogate escape characters, e.g. b"\x80" becomes "\udf80" instead of "\x80"

Code snippet

No response

Expected Behavior

headers encoded as ISO-8859-1 with no MIME type to be decoded correctly without using UTF-8 surrogate escape characters.

How do you run Sanic?

As a script (app.run or Sanic.serve)

Operating System

linux

Sanic Version

22.9.1

Additional context

this used to work as expected in Sanic<=20.12.7

Tronic commented 1 year ago

Which client are you using? Browsers should generally stick to ASCII in request headers but will recognize UTF-8 in responses. Using anything other than ASCII is discouraged, and servers should handle it as opaque binary (not ISO-8859-1, and you cannot specify the charset used in headers). IIRC, you can indeed supply binary values in browsers using Javascript fetch API and there is nothing to say which charset it is in.

Sanic uses surrogate escapes so that you can encode it back to original binary, and then decode as ISO-8859-1 if you wish, but I would strongly advice updating your systems to use UTF-8 or ASCII if in any way possible.

relud commented 1 year ago

I would strongly advice updating your systems to use UTF-8 or ASCII if in any way possible.

my clients code is supposed to only send ASCII, but unfortunately I don't own the client code, in practice my application receives garbage that needs to be preserved, and changing how it handles that garbage is undesirable.

you can encode it back to original binary, and then decode as ISO-8859-1 if you wish

I had not figured this out, and that's sufficient for my use case, thank you.

Tronic commented 1 year ago

Earlier Sanic versions handled headers as ISO-8859-1, which was causing trouble when they actually were in UTF-8 (more common nowadays). I had to put a lot of thought into this while reimplementing the HTTP parser code as leaving them as bytes wouldn't be practical either. The surrogate escape coding is WTF-8 which indeed is meant for preserving garbage, being able to restore original bytes of what might be ill-formed UTF-8. I'm glad you found use for this detail of Sanic's implementation, being able to restore those bytes instead of simply showing "replacement character" as a naive implementation might.

relud commented 1 year ago

ASGI is decoding headers with latin-1 https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/asgi.py#L123

Tronic commented 1 year ago

@ahopkins Might wish to have a consistent implementation here?

ahopkins commented 1 year ago

Yes. @ChihweiLHBird expressed interest in taking this one on.

ChihweiLHBird commented 1 year ago

ASGI is decoding headers with latin-1

I am a little confused about this... If it is encoded by utf-8, which is the general way, can it still be decoded by latin-1? Is latin-1 a superset of utf8? Or are we going to also decode header as utf-8 in asgi?

ChihweiLHBird commented 1 year ago

Should we also change this? https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/asgi.py#L138

gnat commented 1 year ago

This likely has security implications outside of our scope of knowledge. The reason latin-1 is used is because it is a stable, non-changing character set. Unicode can and will change, and also be encoded/decoded/case folded in different ways.

See: https://engineering.atspotify.com/2013/06/creative-usernames/

For reference, both Django and Starlette encode/decode headers, cookies and urls to latin-1. Sanic would be "pioneering" encoding/decoding these to unicode-- this is a high risk change.

Tronic commented 1 year ago

I believe ASGI decodes urls in Latin-1, so better keep that as is (ideally it would also be bytes and not needed to be encoded but can't have everything I suppose).

Unicode to codepoints, which is what encoding/decoding does, is perfectly stable. What the codepoints mean variates slightly from version to another but none of that affects Sanic directly (we do no normalization between NFC and NFD, for instance, even though that might be useful/necessary for supporting filenames coming from MacOS).

Latin-1/ISO-8859-1 itself comes in two different versions. One makes characters 0x80-0x9F illegal, while in the other all 0x00-0xFF are legal. Python's latin-1 and ISO-8859-1 both allow these characters despite them not originally being part of ISO-8859-1 (in Unicode U+0000 to U+001F and U+007F to U+009F are control characters, and the non-printable / control character ranges of ISO-8859-1 directly map to these).

gnat commented 1 year ago

So with latin-1 there's of course no normalisation, and it doesn't have characters added and re-organized with nearly every release of Python, unlike Unicode.

If people are pulling any sort of logical identifier out of their cookies, urls, or headers, it's a problem (in Spotify's case, a username used for auth).

The place where unicode is not dangerous is when it's non-logical content such as message strings. I doubt it's an issue for MacOS because: a) local file system b) filenames are unrelated to file modes and ownership on unix.

Tronic commented 1 year ago

@relud After a discussion we are now planning to use ISO-8859-1 for request headers within the Sanic built-in server as well, to make it match ASGI and other frameworks that behave this way, also matching the behavior of older Sanic releases. It is noted that ISO-8859-1 can also be encoded back to original bytes, from which one can obtain UTF-8 or other decoding if needed.

As your bug report was apparently the first we've received on this, the issue is probably not affecting many at all, but at least this should make your implementation a bit easier. This is a breaking change, so no promises yet on when it will be released, even if everyone else is using ASCII headers and thus isn't affected by it.

Tronic commented 1 year ago

Taking that back. We did some testing and browsers appear to be sending cookie headers in UTF-8. This would break if headers were decoded as ISO-8859-1. Similarly, set-cookie response header can only be in UTF-8, browsers do not accept ISO-8859-1.

It is a mess that may need a bit more time to sort out.

sanic-org / sanic