Open mohmad-null opened 2 weeks ago
On it. There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.
After investigation I found that above case is due to use of canonicalize_url. This is an important function which helps in finding duplicates, etc. We can definitely document this so that it helps user.
There is a canonicalize
parameter that is False
by default, so I’m not so sure this is about canonicalize_url
. Maybe it is Lxml’s behavior? May be worth looking into, and adding a note about it to the reference docs about the canonicalize
parameter.
Regression? I have a HTML file that contains a link like:
<a target="_blank" href="http://MYURL/SomePath/services/words/MorePath?abc">Words</a>
I'm extracting with code that looks like this:
But my URL comes back as:
http://myurl/SomePath/services/words/MorePath?abc
Note that
MYURL
has becomemyurl
. I've just upgraded from Scrapy 1.7.x to 2.11.1. In 1.7 and previously it would come out asMYURL
. There's nothing in LinkExtractor docs about changing case, nor can I see anything in the changelogs (but may be missing that)May or may not be intentional behaviour, but the docs should probably be updated if this is intented to say the case will change.