from scrapy.linkextractors import LinkExtractor
link_extractor = LinkExtractor()
link_extractor.extract_links(response)
where response is a web_poet.page_inputs.http.HttpResponse instance and not scrapy.http.Response.
The full stacktrace would be:
File "/usr/local/lib/python3.10/site-packages/scrapy/linkextractors/[lxmlhtml.py](http://lxmlhtml.py/)", line 239, in extract_links
base_url = get_base_url(response)
File "/usr/local/lib/python3.10/site-packages/scrapy/utils/[response.py](http://response.py/)", line 27, in get_base_url
_baseurl_cache[response] = html.get_base_url(
File "/usr/local/lib/python3.10/site-packages/w3lib/[html.py](http://html.py/)", line 323, in get_base_url
return safe_url_string(baseurl)
File "/usr/local/lib/python3.10/site-packages/w3lib/[url.py](http://url.py/)", line 141, in safe_url_string
decoded = to_unicode(url, encoding=encoding, errors="percentencode")
File "/usr/local/lib/python3.10/site-packages/w3lib/[util.py](http://util.py/)", line 47, in to_unicode
raise TypeError(
TypeError: to_unicode must receive bytes or str, got ResponseUrl
Other alternatives could be adjusting Scrapy code instead to cast str(response.url) for every use.
There was a previous discussion about this before in one of the PRs.
I'm re-opening this for tracking since this part of
w3lib.util.to_unicode
breaks: https://github.com/scrapy/w3lib/blob/master/w3lib/util.py#L46-L49In particular, doing something like:
where
response
is aweb_poet.page_inputs.http.HttpResponse
instance and notscrapy.http.Response
.The full stacktrace would be:
Other alternatives could be adjusting Scrapy code instead to cast
str(response.url)
for every use.