Closed jerith closed 3 years ago
Sorry for the slow response here.
I didn't realize that arbitrary bytes
were ever supported as part of params. I don't think that's really desirable, but I didn't mean to break it. I will have to add test coverage for this.
It looks like there is a similar issue if an oddly-encoded querystring is passed as part of the url parameter. You'll get a UnicodeDecodeError
when passing a URL like:
treq.get('http://requestbin.net/r/1l2gc1o1?foo=%FF%FE%00%00H%00%00%00e%00%00%00l%00%00%00l%00%00%00o%00%00%00')
It looks like we can avoid this by passing lazy=True
when decoding the URL. Then Hyperlink won't try to decode the URL-encoded segments. treq should definitely do this. That doesn't help with params, though.
Unfortunately, https://github.com/twisted/treq/pull/265#discussion_r396922382 isn't the solution either. Decoding UTF-32 bytes
to unicode using charmap (latin-1) and then encoding as UTF-8 gives mojibake rather than the same UTF-32 bytes.
I see two options:
hyperlink.URL
rather than hyperlink.DecodedURL
.hyperlink.DecodedURL
accept bytes
in addition to text
.Could you provide a little color on what you're trying to do to help with this?
I work with a wide variety of third-party messaging APIs, many of which are poorly designed and require all the message fields to be in url query parameters. This is fine when we're sending English-language text with no special characters (which fits into 7-bit ASCII), but can be a problem for languages like Swahili (often written in Arabic script) or Amharic when the API we're talking to expects UTF-16 or some weirdly-packed GSM encoding.
Starting from 20.4.1, I can no longer make requests with weirdly-encoded query parameters.
Here's a small example program that makes such a request:
When run with treq 20.3.0:
And with treq 20.4.1:
https://github.com/twisted/treq/pull/265#discussion_r396922382 seems like relevant context here.