Closed relud closed 1 year ago
Which client are you using? Browsers should generally stick to ASCII in request headers but will recognize UTF-8 in responses. Using anything other than ASCII is discouraged, and servers should handle it as opaque binary (not ISO-8859-1, and you cannot specify the charset used in headers). IIRC, you can indeed supply binary values in browsers using Javascript fetch
API and there is nothing to say which charset it is in.
Sanic uses surrogate escapes so that you can encode it back to original binary, and then decode as ISO-8859-1 if you wish, but I would strongly advice updating your systems to use UTF-8 or ASCII if in any way possible.
I would strongly advice updating your systems to use UTF-8 or ASCII if in any way possible.
my clients code is supposed to only send ASCII, but unfortunately I don't own the client code, in practice my application receives garbage that needs to be preserved, and changing how it handles that garbage is undesirable.
you can encode it back to original binary, and then decode as ISO-8859-1 if you wish
I had not figured this out, and that's sufficient for my use case, thank you.
Earlier Sanic versions handled headers as ISO-8859-1, which was causing trouble when they actually were in UTF-8 (more common nowadays). I had to put a lot of thought into this while reimplementing the HTTP parser code as leaving them as bytes
wouldn't be practical either. The surrogate escape coding is WTF-8 which indeed is meant for preserving garbage, being able to restore original bytes of what might be ill-formed UTF-8. I'm glad you found use for this detail of Sanic's implementation, being able to restore those bytes instead of simply showing "replacement character" as a naive implementation might.
ASGI is decoding headers with latin-1 https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/asgi.py#L123
@ahopkins Might wish to have a consistent implementation here?
Yes. @ChihweiLHBird expressed interest in taking this one on.
ASGI is decoding headers with latin-1
I am a little confused about this... If it is encoded by utf-8
, which is the general way, can it still be decoded by latin-1
? Is latin-1
a superset of utf8
? Or are we going to also decode header as utf-8
in asgi?
Should we also change this? https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/asgi.py#L138
This likely has security implications outside of our scope of knowledge. The reason latin-1 is used is because it is a stable, non-changing character set. Unicode can and will change, and also be encoded/decoded/case folded in different ways.
See: https://engineering.atspotify.com/2013/06/creative-usernames/
For reference, both Django and Starlette encode/decode headers, cookies and urls to latin-1. Sanic would be "pioneering" encoding/decoding these to unicode-- this is a high risk change.
I believe ASGI decodes urls in Latin-1, so better keep that as is (ideally it would also be bytes and not needed to be encoded but can't have everything I suppose).
Unicode to codepoints, which is what encoding/decoding does, is perfectly stable. What the codepoints mean variates slightly from version to another but none of that affects Sanic directly (we do no normalization between NFC and NFD, for instance, even though that might be useful/necessary for supporting filenames coming from MacOS).
Latin-1/ISO-8859-1 itself comes in two different versions. One makes characters 0x80-0x9F illegal, while in the other all 0x00-0xFF are legal. Python's latin-1
and ISO-8859-1
both allow these characters despite them not originally being part of ISO-8859-1 (in Unicode U+0000 to U+001F and U+007F to U+009F are control characters, and the non-printable / control character ranges of ISO-8859-1 directly map to these).
So with latin-1 there's of course no normalisation, and it doesn't have characters added and re-organized with nearly every release of Python, unlike Unicode.
If people are pulling any sort of logical identifier out of their cookies, urls, or headers, it's a problem (in Spotify's case, a username used for auth).
The place where unicode is not dangerous is when it's non-logical content such as message strings. I doubt it's an issue for MacOS because: a) local file system b) filenames are unrelated to file modes and ownership on unix.
@relud After a discussion we are now planning to use ISO-8859-1 for request headers within the Sanic built-in server as well, to make it match ASGI and other frameworks that behave this way, also matching the behavior of older Sanic releases. It is noted that ISO-8859-1 can also be encoded back to original bytes, from which one can obtain UTF-8 or other decoding if needed.
As your bug report was apparently the first we've received on this, the issue is probably not affecting many at all, but at least this should make your implementation a bit easier. This is a breaking change, so no promises yet on when it will be released, even if everyone else is using ASCII headers and thus isn't affected by it.
Taking that back. We did some testing and browsers appear to be sending cookie
headers in UTF-8. This would break if headers were decoded as ISO-8859-1. Similarly, set-cookie
response header can only be in UTF-8, browsers do not accept ISO-8859-1.
It is a mess that may need a bit more time to sort out.
Is there an existing issue for this?
Describe the bug
headers are decoded here without specifying their encoding:
https://github.com/sanic-org/sanic/blob/ad4e526c775fc3ce950503d6476d9d344492b0dd/sanic/http/http1.py#L205
On my system (osx using python 3.10.8 installed via homebrew) this causes bytes that are valid characters in ISO-8859-1 but not in UTF-8 to be decoded as surrogate escape characters, e.g.
b"\x80"
becomes"\udf80"
instead of"\x80"
Code snippet
No response
Expected Behavior
headers encoded as ISO-8859-1 with no MIME type to be decoded correctly without using UTF-8 surrogate escape characters.
How do you run Sanic?
As a script (
app.run
orSanic.serve
)Operating System
linux
Sanic Version
22.9.1
Additional context
this used to work as expected in Sanic<=20.12.7