[BUG] URLs being url decoded before request is sent

duncanlutz commented 2 months ago

Describe the bug We have an endpoint which utilizes IDs that contain URL-encoded special characters. After the update to curl_cffi 0.7.2, these requests began failing. After some investigation we found the package had started url-decoding the url before sending the request. In our case, the special characters are %2f or /, which causes the URL to be malformed and the request to fail.

To Reproduce

from curl_cffi import requests

# used my own website because I know it doesn't redirect on 404
url = 'https://duncanlutz.dev/example/%2f%2f%2f'

session = requests.Session()
response = session.get(url)

# Asserts currently fail
assert response.url == url

I've also put a repo up with the example: https://github.com/duncanlutz/curl_cffi_issue

Expected behavior In previous versions, the url had not been decoded before making the request. Our expected behavior would be to either not decode the URL, or provide a way to opt out of decoding.

Versions

OS: macOS Sonoma 14.1.1
curl_cffi version 0.7.2

pip freeze dump:

certifi==2024.8.30
cffi==1.17.1
curl_cffi==0.7.2
pycparser==2.22
typing_extensions==4.12.2

Additional context

Which session are you using? async or sync? We are using sync.
If using async session, which loop implementation are you using?

coletdjnz commented 2 months ago

+1, our test suite has also picked this up.

Might be due to https://github.com/lexiforest/curl_cffi/commit/9c13b830f378687900ddbb953ae8edb9998b3b1d

As a side note, I don't think you should be url-decoding and then re-encoding the query component, as it may not produce the same result. https://datatracker.ietf.org/doc/html/rfc3986#section-2.4

Here's what urllib3 does, for example.: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/url.py#L227

lexiforest commented 2 months ago

Thanks, my bad. I should really think this through.

vevv commented 2 months ago

I can confirm this causes issues on real life sites, very annoying to debug too, almost started pulling my hair out before I found this issue.

Kartatz commented 2 months ago

@lexiforest.

Since curl_cffi aims to be API-compatible with the requests library, may I suggest using requests' requote_uri()? It is the standard way the library deals with URL-encoded strings.

assert requote_uri("https://duncanlutz.dev/example/%2f%2f%2f") == "https://duncanlutz.dev/example/%2f%2f%2f"
assert requote_uri("https://duncanlutz.dev/e x a m p l e") == "https://duncanlutz.dev/e%20x%20a%20m%20p%20l%20e"

It covers #333 while also fixing this current issue.

punksnotbread commented 2 months ago

Hi, same is experienced where encoding is done where it should not be (due to this change), breaking some sites:

from curl_cffi.requests import request

url = 'https://example.com/imaginary-pagination:7'

print(url)
print(request("GET", url).request.url)

https://example.com/imaginary-pagination:7
https://example.com/imaginary-pagination%3A7

Would be great to have an option to control encoding of URL for request

lexiforest commented 1 month ago

Hi, folks. Please checkout #405 and let me know if it fixes you problems.

About the urllib3 and requests solution, I did experiment with them. However, I feel like that we should give users more control over whether some letters, like the :, should be quoted or not.

lexiforest commented 1 month ago

Should be fixed in v0.7.3.

vevv commented 1 month ago

This is still an issue on 0.7.3 (particularly + and =). You should just stop modifying URLs! It's always going to lead to trouble, and having to manually test and change quote values for every request is not viable.

lexiforest commented 1 month ago

This is still an issue (particularly + and =). You should just stop modifying URLs! It's always going to lead to trouble, and having to manually test and change quote values for every request is not viable.

Hi, could you please add a few examples? Some characters DO need to be quoted, like spaces, otherwise libcurl will throw an error. As for + and =, I guess they are being mistakenly unquoted from %3D to =, right?

vevv commented 1 month ago

Yes, it is a URL being unquoted.

Here is an example URL:

Input URL: https://example.com/path?token=example%7C2024-10-20T10%3A00%3A00Z%7ZYJkEtJQoGNQ3lyQRSnYbWLXUCUNVPQrBDW3VDEBWd1CIrShUzWBQTvzwXEtLZwy8uAxIM%2B3ke%2BQW%2F%2FkyJzGGogANuv5rw%2FXXp%2B5hZz2RW28%3D%7C8bd02e990e29ec76b54cec894e1470b4157fc1ed
curl-cffi: https://example.com/path?token=example%7C2024-10-20T10:00:40Z%7ZYJkEtJQoGNQ3lyQRSnYbWLXUCUNVPQrBDW3VDEBWd1CIrShUzWBQTvzwXEtLZwy8uAxIM+3ke+QW//kyJzGGogANuv5rw/XXp+5hZz2RW28=%7C8bd02e990e29ec76b54cec894e1470b4157fc1ed
requests is same as input

lexiforest commented 1 month ago

I see, this is not what I would expect, too. Sorry for the mess, it will be fixed in the next minor version.

vevv commented 1 month ago

Same happens with encoded commas as well, probably all encoded characters.

lexiforest / curl_cffi

[BUG] URLs being url decoded before request is sent #394