scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
390 stars 104 forks source link

Issue in safe_url_encoding #188

Closed azierariel closed 2 years ago

azierariel commented 2 years ago

Hi,

I hope I find you well.

I found the following issue, not sure if it's a bug, but it didn't behave like this in the previous version (1.22).

Before on 1.22.0:

safe_url_string("foo://BAR/foobar") == "foo://BAR/foobar"

But now on 2.0.1:

safe_url_string("foo://BAR/foobar") == "foo://bar/foobar"

I understand that the domain is not case sensitive, but is this the desired behavior of the function?

Cheers,

wRAR commented 2 years ago

This is caused by #174, but I have no idea which line would do this.

azierariel commented 2 years ago

@wRAR I found why is this happening. The hostname is set to lower cases directly from urllib.urlsplit (here), when this function is applied in the code here.

I could make a workaround I guess, but it would be an awful hack imo. Do you think it worth to change the behavior? I pass parameters to download_handlers in scrapy and right now i had to freeze the previous version.

cc @Gallaecio @Laerte

wRAR commented 2 years ago

Hmm I was afraid urllib does that, but I only tested urlunsplit, not urlsplit.

To be honest I don't think we should work around this in the w3lib code but I'm open to suggestions.

kmike commented 2 years ago

@azierariel I'm curious how have you found this issue - what does this change break for you? Why do you need to freeze w3lib to a previous version?

azierariel commented 2 years ago

@kmike I pass parameters to the download_handlers in scrapy. For example: captcha://file_path.png, here if the file path has any upper case it will be transformed to lower and my code will fail.

wRAR commented 2 years ago

That just looks like an invalid URI. You should put paths into the path section, not the host one.

Gallaecio commented 2 years ago

@azierariel Does it work if you use an extra / after the protocol? (i.e. captcha:///file_path.png) That is what the file:// scheme does to omit the host part, which in the case of file:// implies localhost.

azierariel commented 2 years ago

@Gallaecio It works perfectly, thanks!

I think I can close the issue right?