Can't make request with non-ascii/utf8 query params

jerith commented 3 years ago

Starting from 20.4.1, I can no longer make requests with weirdly-encoded query parameters.

Here's a small example program that makes such a request:

import treq
from twisted.internet.task import react

URL = 'http://requestbin.net/r/1l2gc1o1'

def main(reactor):
    d = treq.get(URL, params={b'foo': 'Hello'.encode('utf-32')})
    return d.addCallback(print)

react(main)

When run with treq 20.3.0:

(tmp-a84c01591caa48e) [16:54|jerith@meklon] ~/.virtualenvs/tmp-a84c01591caa48e% python foo.py
<treq.response._Response 200 'text/html; charset=utf-8' unknown size>

And with treq 20.4.1:

(tmp-a84c01591caa48e) [16:56|jerith@meklon] ~/.virtualenvs/tmp-a84c01591caa48e% python foo.py
Traceback (most recent call last):
  File "foo.py", line 10, in <module>
    react(main)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/twisted/internet/task.py", line 909, in react
    finished = main(_reactor, *argv)
  File "foo.py", line 7, in main
    d = treq.get(URL, params={b'foo': 'Hello'.encode('utf-32')})
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/api.py", line 24, in get
    return _client(**kwargs).get(url, headers=headers, **kwargs)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 119, in get
    return self.request('GET', url, **kwargs)
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 167, in request
    query=parsed_url.query + tuple(_coerced_query_params(params))
  File "/Users/jerith/.virtualenvs/tmp-a84c01591caa48e/lib/python3.8/site-packages/treq/client.py", line 351, in _coerced_query_params
    value = value.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

https://github.com/twisted/treq/pull/265#discussion_r396922382 seems like relevant context here.

twm commented 3 years ago

Sorry for the slow response here.

I didn't realize that arbitrary bytes were ever supported as part of params. I don't think that's really desirable, but I didn't mean to break it. I will have to add test coverage for this.

It looks like there is a similar issue if an oddly-encoded querystring is passed as part of the url parameter. You'll get a UnicodeDecodeError when passing a URL like:

treq.get('http://requestbin.net/r/1l2gc1o1?foo=%FF%FE%00%00H%00%00%00e%00%00%00l%00%00%00l%00%00%00o%00%00%00')

It looks like we can avoid this by passing lazy=True when decoding the URL. Then Hyperlink won't try to decode the URL-encoded segments. treq should definitely do this. That doesn't help with params, though.

Unfortunately, https://github.com/twisted/treq/pull/265#discussion_r396922382 isn't the solution either. Decoding UTF-32 bytes to unicode using charmap (latin-1) and then encoding as UTF-8 gives mojibake rather than the same UTF-32 bytes.

I see two options:

Rewrite treq's URL-generation routines to use the low-level hyperlink.URL rather than hyperlink.DecodedURL.
Make hyperlink.DecodedURL accept bytes in addition to text.

Could you provide a little color on what you're trying to do to help with this?

jerith commented 3 years ago

I work with a wide variety of third-party messaging APIs, many of which are poorly designed and require all the message fields to be in url query parameters. This is fine when we're sending English-language text with no special characters (which fits into 7-bit ASCII), but can be a problem for languages like Swahili (often written in Arabic script) or Amharic when the API we're talking to expects UTF-16 or some weirdly-packed GSM encoding.

twisted / treq

Can't make request with non-ascii/utf8 query params #303