psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.04k stars 9.3k forks source link

UnicodeDecodeError after following a chain of redirects #6026

Open wodim opened 2 years ago

wodim commented 2 years ago

6006

Something confuses requests (or urllib3?) along the way

Actual Result UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 92: invalid continuation byte

Reproduction Steps import requests requests.get("https://www.lavozdegalicia.es/noticia/deportes/2021/12/13/psg-juve-united-nuevos-rivales-espa%C3%B1oles-champions/00031639396272418389372.htm")

System Information $ python -m requests.help { "chardet": { "version": "4.0.0" }, "charset_normalizer": { "version": "2.0.9" }, "cryptography": { "version": "36.0.0" }, "idna": { "version": "3.3" }, "implementation": { "name": "CPython", "version": "3.8.10" }, "platform": { "release": "4.4.0-17763-Microsoft", "system": "Linux" }, "pyOpenSSL": { "openssl_version": "101010cf", "version": "21.0.0" }, "requests": { "version": "2.26.0" }, "system_ssl": { "version": "1010106f" }, "urllib3": { "version": "1.26.7" }, "using_charset_normalizer": false, "using_pyopenssl": true }

@sigmavirus24 you have been too harsh on this one.

Traceback (most recent call last): File "", line 1, in File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, kwargs) File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 677, in send history = [resp for resp in gen] File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 677, in history = [resp for resp in gen] File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 150, in resolve_redirects url = self.get_redirect_target(resp) File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\sessions.py", line 116, in get_redirect_target return to_native_string(location, 'utf8') File "C:\Users\Ahmed\AppData\Local\Programs\Python\Python37\lib\site-packages\requests_internal_utils.py", line 25, in to_native_string out = string.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 92: invalid continuation byte

The guilty part there is return to_native_string(location, 'utf8') Trying to decode the URL to native utf8 when it should rather "URL encode" it. I am not an HTTP expert, but this exception should be handled more gracefully anyway.

The location given by the remote for redirection is as follow:

b'http://www.lavozdegalicia.es/noticia/deportes/2021/12/13/psg-juve-united-nuevos-rivales-espa\xf1oles-champions/00031639396272418389372.htm'

Should the \xf1 be decoded as follow %F1 ?

I can see that this behavior is already followed by Chrome.

nateprewitt commented 2 years ago

Hi @wodim,

To actually address your issue, the URI being returned by this redirect is invalid. All byte sequences that aren't listed as unreserved or sub-delim MUST be percent encoded in the path (ref). The website is doing this correctly for other paths, so it appears this is a defect in this specific resource.

For Requests, we may be able to be more tolerant of this behavior. Any solution we implement is purely guessing and trying to support broken behavior though, which we typically avoid. We could potentially try to percent-encode this when the UnicodeDecodeError is raised. Given this is the first report of the issue in at least the last 5 years, I'm not sure it's a common enough defect to special case.

rhettlunn commented 1 year ago

I'm encountering this error as well for https://www.liveinternet.ru/tags/%EF%F0%E5%E7%E8%E4%E5%ED%F2%FB%2B%D1%D8%C0/ (and various other pages on that site)

I think a good fix would be to catch the malformed redirect URL (or any similar invalid header) and raise something that inherits from RequestException, rather than raising a UnicodeDecodeError.

vkruoso commented 1 year ago

This is happening to me as well. One source site is now returning those malformed URLs during redirects, and there is now way to bypass it as far as I understand it.

SamStephens commented 1 week ago

I have another instance of this issue, which unfortunately I cannot share as its private to my employer. Firefox and Chrome both handle the URL I've encountered cleanly as per the behavior @wodim describes.

I see that that [the fork niquests) has a bugfix for this issue](https://github.com/jawah/niquests/pull/20; it could be worth borrowing their change. Alternatively a more meaningful exception than UnicodeDecodeError would be useful.