lexiforest / curl_cffi

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.
https://curl-cffi.readthedocs.io/
MIT License
2.51k stars 266 forks source link

[BUG] URLs being url decoded before request is sent #394

Open duncanlutz opened 2 months ago

duncanlutz commented 2 months ago

Describe the bug We have an endpoint which utilizes IDs that contain URL-encoded special characters. After the update to curl_cffi 0.7.2, these requests began failing. After some investigation we found the package had started url-decoding the url before sending the request. In our case, the special characters are %2f or /, which causes the URL to be malformed and the request to fail.

To Reproduce

from curl_cffi import requests

# used my own website because I know it doesn't redirect on 404
url = 'https://duncanlutz.dev/example/%2f%2f%2f'

session = requests.Session()
response = session.get(url)

# Asserts currently fail
assert response.url == url

I've also put a repo up with the example: https://github.com/duncanlutz/curl_cffi_issue

Expected behavior In previous versions, the url had not been decoded before making the request. Our expected behavior would be to either not decode the URL, or provide a way to opt out of decoding.

Versions

Additional context

coletdjnz commented 2 months ago

+1, our test suite has also picked this up.

Might be due to https://github.com/lexiforest/curl_cffi/commit/9c13b830f378687900ddbb953ae8edb9998b3b1d


As a side note, I don't think you should be url-decoding and then re-encoding the query component, as it may not produce the same result. https://datatracker.ietf.org/doc/html/rfc3986#section-2.4

Here's what urllib3 does, for example.: https://github.com/urllib3/urllib3/blob/main/src/urllib3/util/url.py#L227

lexiforest commented 2 months ago

Thanks, my bad. I should really think this through.

vevv commented 2 months ago

I can confirm this causes issues on real life sites, very annoying to debug too, almost started pulling my hair out before I found this issue.

Kartatz commented 2 months ago

@lexiforest.

Since curl_cffi aims to be API-compatible with the requests library, may I suggest using requests' requote_uri()? It is the standard way the library deals with URL-encoded strings.

assert requote_uri("https://duncanlutz.dev/example/%2f%2f%2f") == "https://duncanlutz.dev/example/%2f%2f%2f"
assert requote_uri("https://duncanlutz.dev/e x a m p l e") == "https://duncanlutz.dev/e%20x%20a%20m%20p%20l%20e"

It covers #333 while also fixing this current issue.

punksnotbread commented 2 months ago

Hi, same is experienced where encoding is done where it should not be (due to this change), breaking some sites:

from curl_cffi.requests import request

url = 'https://example.com/imaginary-pagination:7'

print(url)
print(request("GET", url).request.url)
https://example.com/imaginary-pagination:7
https://example.com/imaginary-pagination%3A7

Would be great to have an option to control encoding of URL for request

lexiforest commented 1 month ago

Hi, folks. Please checkout #405 and let me know if it fixes you problems.

About the urllib3 and requests solution, I did experiment with them. However, I feel like that we should give users more control over whether some letters, like the :, should be quoted or not.

lexiforest commented 1 month ago

Should be fixed in v0.7.3.

vevv commented 1 month ago

This is still an issue on 0.7.3 (particularly + and =). You should just stop modifying URLs! It's always going to lead to trouble, and having to manually test and change quote values for every request is not viable.

lexiforest commented 1 month ago

This is still an issue (particularly + and =). You should just stop modifying URLs! It's always going to lead to trouble, and having to manually test and change quote values for every request is not viable.

Hi, could you please add a few examples? Some characters DO need to be quoted, like spaces, otherwise libcurl will throw an error. As for + and =, I guess they are being mistakenly unquoted from %3D to =, right?

vevv commented 1 month ago

Yes, it is a URL being unquoted.

Here is an example URL:

lexiforest commented 1 month ago

I see, this is not what I would expect, too. Sorry for the mess, it will be fixed in the next minor version.

vevv commented 1 month ago

Same happens with encoded commas as well, probably all encoded characters.