Open BurnzZ opened 4 years ago
web-poet works independently of Scrapy, but you can use scrapy-poet to integrate web-poet into Scrapy projects.
scrapy-poet offers an injector middleware that builds and injects dependencies. It has some "Scrapy-provided" classes that are automatically injected without the need of a provider.
You should then be able to create a Page Object that looks like this:
from scrapy.http import Response
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage
class SomePage(ItemWebPage):
scrapy_response = Response
def to_item(self):
return {
'links': LinkExtractor(
allow=r'some-website.com/product/tt\d+/$'
process_value=some_processor,
restrict_xpaths=f'//div[@id="products"]//span',
).extract_links(self.scrapy_response) # expects a Scrapy Response instance
}
Note that currently, this approach will make it mandatory for the request to be made through Scrapy downloader middleware. So for example, if you're using AutoExtract providers that ignore Scrapy requests, you'll end up making an additional request in order to build this Scrapy response. The solution in this case would be to mock a Scrapy response from the AutoExtract response's HTML.
Hi @victor-torres , thanks for taking the time on explaining this feature!
This actually opens up a lot of opportunities for us, like having a neat way to get the response.meta
from Scrapy.
However, I've met with some peculiar bug with this approach. I've created a new issue in #11 to discuss it further.
As I've said in #11, the example I gave you is wrong in two ways:
scrapy_response: Response
instead of scrapy_response = Response
(this is why you're getting the property object instead of a value)That means you should create another type (ScrapyResponse, for example) that's initialized with a Response and then create a provider that makes use of Response to build this ScrapyResponse class. Does that make sense? Confusing, right?
It should be something like this:
from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a'
).extract_links(self.scrapy_response.response),
}
Here comes the problems:
I've just tested this code and it's not working 😞 I'm not sure what's happening because there's a test case specifically designed to make sure Scrapy dependencies are injected into Providers.
In any case, I'd like to leave the question here to @kmike, @ivanprado, @ejulio and others: I know that we've previously discussed about this, but maybe we could rethink about making Scrapy dependencies available for Page Objects as well. What do you think?
We've discussed this issue today and @ivanprado was able to spot the problem with my snippet: it's missing an __init__
method or an attrib.s
decoration, like the ItemWebPage
itself. The example below should work:
import attr
from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a'
).extract_links(self.scrapy_response.response),
}
Although it's possible to receive Scrapy dependencies when initializing providers to check some runtime conditions or settings, it doesn't look correct to create a provider specifically for Scrapy Responses.
It looks like we should contribute with a pull request to Scrapy making LinkExtractor work with generic HTML data. @ivanprado is currently using one of its private methods to archive this behavior, we should aim at providing the same feature through a public interface. This way, you can just pass ResponseData to LinkExtractor and it should work as expected, @BurnzZ.
One point on PageObjects is that they should be portable. That is, they shouldn't depend on ScrapyResponse
but on ResponsaData
instead.
There is a workaround to use link extractor the following way:
from w3lib.html import get_base_url
class QuotesListingPage(ItemWebPage):
def to_item(self):
link_extractor = LinkExtractor()
base_url = get_base_url(self.html, self.url)
author_links = [link for sel in self.css('.author + a')
for link in link_extractor._extract_links(sel, self.url, "utf-8", base_url)]
return {
'site_name': self.css('h1 a::text').get(),
'author_links': author_links,
}
The problem is that it is using the private method _extract_links
, which is not nice. And also you need to iterate manually over the selectors.
Ideas about what can be done to improve the situation:
Modify the LinkExtractor in Scrapy to be more friendly to cases for which you don't have the ScrapyResponse, but the required components instead (html
, url
, and selector or a list of selectors). Something like making public _extract_links
but probably also modify a bit the signature to be more friendly.
Create scrapy poet versions of the LinkExtractor that accepts ResponseData. It should be something like:
class LinkExtractor(LxmlLinkExtractor):
def __init__(*args, **kwargs):
super().__init__(*args, **kwargs)
def extract_links(self, response: ResponseShortcutsMixin):
base_url = get_base_url(response.html, response.url)
if self.restrict_xpaths:
docs = [
subdoc
for x in self.restrict_xpaths
for subdoc in response.xpath(x)
]
else:
docs = [response.selector]
all_links = []
for doc in docs:
links = self._extract_links(doc, response.url, "utf-8", base_url)
all_links.extend(self._process_links(links))
return unique_list(all_links)
Thoughts?
I think the way to go is to refactor Scrapy's LinkExtractor, and likely move it into a separate package. In the meantime, any workaround is ok.
Thanks for all the help everyone!
I was able to make it work using the providers. 🎉 I attached the full reproducible code below for posterity.
However, I tried to take it a step further by perhaps creating a MetaFromScrapyResponse
type of dependency. However, when attempting to access response.meta
I encounter this type of error:
# full lines omitted
self.meta = response.meta
File "/Users/path/python3.8/site-packages/scrapy/http/response/__init__.py", line 46, in meta
raise AttributeError(
AttributeError: Response.meta not available, this response is not tied to any request
Creating a meta
dependency is useful for us in some ways especially if it contains key data from the previously crawled pages. It essentially allows us to pass different information between page objects that might be useful in parsing.
In order to simplify my example, I've added a commented line in the snippet below that could reproduce this problem.
I have to note that accessing the response.meta
is doable via the ScrapyResponse
as demonstrated below. However, when attempting to create a cleaner approach, creating something like a dedicated MetaFromScrapyResponse
would help.
import attr
from scrapy import Spider
from scrapy.http import Response, Request
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage
from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
# self.meta = response.meta # uncomment this and it'll error out.
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a').extract_links(self.scrapy_response.response),
}
@attr.s(auto_attribs=True)
class QuotesAuthorPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
assert self.scrapy_response.response.meta['some_other_meta_field']
base_item = self.scrapy_response.response.meta['item'].copy()
base_item.update({
'author': self.css('.author-title ::text').get('').strip(),
'born': self.css('.author-born-date::text').get('').strip(),
})
return base_item
class QuotesBaseSpider(Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com//']
def parse(self, response, page: QuotesListingPage):
meta = {
'item': {'field': 'value'},
'some_other_meta_field': True,
}
data = page.to_item()
for link in data['author_links']:
yield Request(link.url, self.parse_author, meta=meta)
def parse_author(self, response, page: QuotesAuthorPage):
return page.to_item()
Cheers!
So, response.meta
is basically a shortcut for response.request.meta
.
If I understand this correctly, you cannot access meta because there is no request object set for the response your ScrapyResponse
class receives.
The reason for this might be the way InjectionMiddleware
is implemented.
However, you can access meta if you provide the request object as parameter to ScrapyResponseProvider
and from there request.meta
to ScrapyResponse
like this:
class ScrapyResponse:
def __init__(self, response, meta):
self.response = response
self.meta = meta
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response, request: Request):
self.response = response
self.meta = request.meta
def __call__(self):
return ScrapyResponse(self.response, self.meta)
Hope this helps.
I think the way to go is to refactor Scrapy's LinkExtractor, and likely move it into a separate package.
With the current API, I guess we could add extract_links
(get_links
? links
? urls
?) to SelectableMixin
. As for the base implementation to be shared by Scrapy and web-poet, where should it live? w3lib
? web-poet?
As for the base implementation to be shared by Scrapy and web-poet
More options: parsel, separate library :)
One neat feature inside Scrapy is it's LinkExtractors functionality. We usually try to use this whenever we want links to be extracted inside a given page.
Inside web-poet, we can attempt to use it as:
The problem lies in the
extract_links()
method since it actually expects a Scrapy Response instance. On the current scope, we only have access to web-poet's ResponseData instead.At the moment, we could simply re-work the logic to avoid using
LinkExtractors
altogether. However, there might be some cases wherein it's a much better option.With this in mind, this issue attempts to be a starting point to open up these discussion points:
LinkExtractors
and web-poet being decoupled away from Scrapy itself, is it worth supportingLinkExtractors
?LinkExtractors
itself so it would be compatible with web-poet?