scrapinghub / web-poet

Web scraping Page Objects core library
https://web-poet.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
93 stars 15 forks source link

`_Url` to inherit from `str` #187

Open BurnzZ opened 1 year ago

BurnzZ commented 1 year ago

There was a previous discussion about this before in one of the PRs.

I'm re-opening this for tracking since this part of w3lib.util.to_unicode breaks: https://github.com/scrapy/w3lib/blob/master/w3lib/util.py#L46-L49

In particular, doing something like:

from scrapy.linkextractors import LinkExtractor

link_extractor = LinkExtractor()
link_extractor.extract_links(response) 

where response is a web_poet.page_inputs.http.HttpResponse instance and not scrapy.http.Response.

The full stacktrace would be:

File "/usr/local/lib/python3.10/site-packages/scrapy/linkextractors/[lxmlhtml.py](http://lxmlhtml.py/)", line 239, in extract_links
    base_url = get_base_url(response)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/[response.py](http://response.py/)", line 27, in get_base_url
    _baseurl_cache[response] = html.get_base_url(
  File "/usr/local/lib/python3.10/site-packages/w3lib/[html.py](http://html.py/)", line 323, in get_base_url
    return safe_url_string(baseurl)
  File "/usr/local/lib/python3.10/site-packages/w3lib/[url.py](http://url.py/)", line 141, in safe_url_string
    decoded = to_unicode(url, encoding=encoding, errors="percentencode")
  File "/usr/local/lib/python3.10/site-packages/w3lib/[util.py](http://util.py/)", line 47, in to_unicode
    raise TypeError(
TypeError: to_unicode must receive bytes or str, got ResponseUrl

Other alternatives could be adjusting Scrapy code instead to cast str(response.url) for every use.