scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

LinkExtractor changing case of URL (but didn't used to) #6329

Open mohmad-null opened 2 weeks ago

mohmad-null commented 2 weeks ago

Regression? I have a HTML file that contains a link like:

<a target="_blank" href="http://MYURL/SomePath/services/words/MorePath?abc">Words</a>

I'm extracting with code that looks like this:

    link_extractor = LinkExtractor(
        restrict_xpaths=xpath)

    tmp_links = link_extractor.extract_links(response)

But my URL comes back as: http://myurl/SomePath/services/words/MorePath?abc

Note that MYURL has become myurl. I've just upgraded from Scrapy 1.7.x to 2.11.1. In 1.7 and previously it would come out as MYURL. There's nothing in LinkExtractor docs about changing case, nor can I see anything in the changelogs (but may be missing that)

May or may not be intentional behaviour, but the docs should probably be updated if this is intented to say the case will change.

kumar-sanchay commented 2 weeks ago

On it. There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.

kumar-sanchay commented 2 weeks ago

After investigation I found that above case is due to use of canonicalize_url. This is an important function which helps in finding duplicates, etc. We can definitely document this so that it helps user.

Gallaecio commented 1 week ago

There is a canonicalize parameter that is False by default, so I’m not so sure this is about canonicalize_url. Maybe it is Lxml’s behavior? May be worth looking into, and adding a note about it to the reference docs about the canonicalize parameter.