Discussion for supporting Scrapy's LinkExtractor

BurnzZ commented 4 years ago

One neat feature inside Scrapy is it's LinkExtractors functionality. We usually try to use this whenever we want links to be extracted inside a given page.

Inside web-poet, we can attempt to use it as:

from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage

class SomePage(ItemWebPage):

    def to_item(self):
        return {
            'links': LinkExtractor(
                allow=r'some-website.com/product/tt\d+/$'
                process_value=some_processor,
                restrict_xpaths=f'//div[@id="products"]//span',
            ).extract_links(self.response)  # expects a Scrapy Response instance
        }

The problem lies in the extract_links() method since it actually expects a Scrapy Response instance. On the current scope, we only have access to web-poet's ResponseData instead.

At the moment, we could simply re-work the logic to avoid using LinkExtractors altogether. However, there might be some cases wherein it's a much better option.

With this in mind, this issue attempts to be a starting point to open up these discussion points:

Given Scrapy's current roadmap on its LinkExtractors and web-poet being decoupled away from Scrapy itself, is it worth supporting LinkExtractors?
or instead, should we update LinkExtractors itself so it would be compatible with web-poet?

victor-torres commented 4 years ago

web-poet works independently of Scrapy, but you can use scrapy-poet to integrate web-poet into Scrapy projects.

scrapy-poet offers an injector middleware that builds and injects dependencies. It has some "Scrapy-provided" classes that are automatically injected without the need of a provider.

You should then be able to create a Page Object that looks like this:

from scrapy.http import Response
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage

class SomePage(ItemWebPage):

    scrapy_response = Response

    def to_item(self):
        return {
            'links': LinkExtractor(
                allow=r'some-website.com/product/tt\d+/$'
                process_value=some_processor,
                restrict_xpaths=f'//div[@id="products"]//span',
            ).extract_links(self.scrapy_response)  # expects a Scrapy Response instance
        }

Note that currently, this approach will make it mandatory for the request to be made through Scrapy downloader middleware. So for example, if you're using AutoExtract providers that ignore Scrapy requests, you'll end up making an additional request in order to build this Scrapy response. The solution in this case would be to mock a Scrapy response from the AutoExtract response's HTML.

BurnzZ commented 3 years ago

Hi @victor-torres , thanks for taking the time on explaining this feature!

This actually opens up a lot of opportunities for us, like having a neat way to get the response.meta from Scrapy.

However, I've met with some peculiar bug with this approach. I've created a new issue in #11 to discuss it further.

victor-torres commented 3 years ago

As I've said in #11, the example I gave you is wrong in two ways:

it should be scrapy_response: Response instead of scrapy_response = Response (this is why you're getting the property object instead of a value)
Scrapy dependencies should not be injected into Page Objects but into Providers only, so it wouldn't work after this fix anyway

That means you should create another type (ScrapyResponse, for example) that's initialized with a Response and then create a provider that makes use of Response to build this ScrapyResponse class. Does that make sense? Confusing, right?

It should be something like this:

from scrapy_poet.page_input_providers import PageObjectInputProvider, provides

class ScrapyResponse:

    def __init__(self, response):
        self.response = response

@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):

    def __init__(self, response: Response):
        self.response = response

    def __call__(self):
        return ScrapyResponse(self.response)

class QuotesListingPage(ItemWebPage):

    scrapy_response: ScrapyResponse

    def to_item(self):
        return {
            'site_name': self.css('h1 a::text').get(),
            'author_links': LinkExtractor(
                restrict_css='.author + a'
            ).extract_links(self.scrapy_response.response),
        }

Here comes the problems:

I've just tested this code and it's not working 😞 I'm not sure what's happening because there's a test case specifically designed to make sure Scrapy dependencies are injected into Providers.

In any case, I'd like to leave the question here to @kmike, @ivanprado, @ejulio and others: I know that we've previously discussed about this, but maybe we could rethink about making Scrapy dependencies available for Page Objects as well. What do you think?

victor-torres commented 3 years ago

We've discussed this issue today and @ivanprado was able to spot the problem with my snippet: it's missing an __init__ method or an attrib.s decoration, like the ItemWebPage itself. The example below should work:

import attr

from scrapy_poet.page_input_providers import PageObjectInputProvider, provides

class ScrapyResponse:

    def __init__(self, response):
        self.response = response

@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):

    def __init__(self, response: Response):
        self.response = response

    def __call__(self):
        return ScrapyResponse(self.response)

@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):

    scrapy_response: ScrapyResponse

    def to_item(self):
        return {
            'site_name': self.css('h1 a::text').get(),
            'author_links': LinkExtractor(
                restrict_css='.author + a'
            ).extract_links(self.scrapy_response.response),
        }

Although it's possible to receive Scrapy dependencies when initializing providers to check some runtime conditions or settings, it doesn't look correct to create a provider specifically for Scrapy Responses.

It looks like we should contribute with a pull request to Scrapy making LinkExtractor work with generic HTML data. @ivanprado is currently using one of its private methods to archive this behavior, we should aim at providing the same feature through a public interface. This way, you can just pass ResponseData to LinkExtractor and it should work as expected, @BurnzZ.

ivanprado commented 3 years ago

One point on PageObjects is that they should be portable. That is, they shouldn't depend on ScrapyResponse but on ResponsaData instead.

There is a workaround to use link extractor the following way:

from w3lib.html import get_base_url

class QuotesListingPage(ItemWebPage):

    def to_item(self):
        link_extractor = LinkExtractor()
        base_url = get_base_url(self.html, self.url)
        author_links = [link for sel in self.css('.author + a')
                        for link in link_extractor._extract_links(sel, self.url, "utf-8", base_url)]
        return {
            'site_name': self.css('h1 a::text').get(),
            'author_links': author_links,
        }

The problem is that it is using the private method _extract_links, which is not nice. And also you need to iterate manually over the selectors.

Ideas about what can be done to improve the situation:

Modify the LinkExtractor in Scrapy to be more friendly to cases for which you don't have the ScrapyResponse, but the required components instead (html, url, and selector or a list of selectors). Something like making public _extract_links but probably also modify a bit the signature to be more friendly.

Create scrapy poet versions of the LinkExtractor that accepts ResponseData. It should be something like:

class LinkExtractor(LxmlLinkExtractor):

def __init__(*args, **kwargs):
    super().__init__(*args, **kwargs)

def extract_links(self, response: ResponseShortcutsMixin):
    base_url = get_base_url(response.html, response.url)
    if self.restrict_xpaths:
        docs = [
            subdoc
            for x in self.restrict_xpaths
            for subdoc in response.xpath(x)
        ]
    else:
        docs = [response.selector]
    all_links = []
    for doc in docs:
        links = self._extract_links(doc, response.url, "utf-8", base_url)
        all_links.extend(self._process_links(links))
    return unique_list(all_links)

Thoughts?

kmike commented 3 years ago

I think the way to go is to refactor Scrapy's LinkExtractor, and likely move it into a separate package. In the meantime, any workaround is ok.

BurnzZ commented 3 years ago

Thanks for all the help everyone!

I was able to make it work using the providers. 🎉 I attached the full reproducible code below for posterity.

However, I tried to take it a step further by perhaps creating a MetaFromScrapyResponse type of dependency. However, when attempting to access response.meta I encounter this type of error:

# full lines omitted
    self.meta = response.meta
  File "/Users/path/python3.8/site-packages/scrapy/http/response/__init__.py", line 46, in meta
    raise AttributeError(
AttributeError: Response.meta not available, this response is not tied to any request

Creating a meta dependency is useful for us in some ways especially if it contains key data from the previously crawled pages. It essentially allows us to pass different information between page objects that might be useful in parsing.

In order to simplify my example, I've added a commented line in the snippet below that could reproduce this problem.

I have to note that accessing the response.meta is doable via the ScrapyResponse as demonstrated below. However, when attempting to create a cleaner approach, creating something like a dedicated MetaFromScrapyResponse would help.

import attr
from scrapy import Spider
from scrapy.http import Response, Request
from scrapy.linkextractors import LinkExtractor

from web_poet.pages import ItemWebPage

from scrapy_poet.page_input_providers import PageObjectInputProvider, provides

class ScrapyResponse:
    def __init__(self, response):
        self.response = response
        # self.meta = response.meta  # uncomment this and it'll error out.

@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):

    def __init__(self, response: Response):
        self.response = response

    def __call__(self):
        return ScrapyResponse(self.response)

@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):
    scrapy_response: ScrapyResponse

    def to_item(self):
        return {
            'site_name': self.css('h1 a::text').get(),
            'author_links': LinkExtractor(
                restrict_css='.author + a').extract_links(self.scrapy_response.response),
        }

@attr.s(auto_attribs=True)
class QuotesAuthorPage(ItemWebPage):
    scrapy_response: ScrapyResponse

    def to_item(self):
        assert self.scrapy_response.response.meta['some_other_meta_field']
        base_item = self.scrapy_response.response.meta['item'].copy()
        base_item.update({
            'author': self.css('.author-title ::text').get('').strip(),
            'born': self.css('.author-born-date::text').get('').strip(),
        })
        return base_item

class QuotesBaseSpider(Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com//']

    def parse(self, response, page: QuotesListingPage):
        meta = {
            'item': {'field': 'value'},
            'some_other_meta_field': True,
        }

        data = page.to_item()

        for link in data['author_links']:
            yield Request(link.url, self.parse_author, meta=meta)

    def parse_author(self, response, page: QuotesAuthorPage):
        return page.to_item()

Cheers!

patkle commented 3 years ago

So, response.meta is basically a shortcut for response.request.meta. If I understand this correctly, you cannot access meta because there is no request object set for the response your ScrapyResponse class receives. The reason for this might be the way InjectionMiddleware is implemented.

However, you can access meta if you provide the request object as parameter to ScrapyResponseProvider and from there request.meta to ScrapyResponse like this:

class ScrapyResponse:
    def __init__(self, response, meta):
        self.response = response
        self.meta = meta

@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
    def __init__(self, response: Response, request: Request):
        self.response = response
        self.meta = request.meta

    def __call__(self):
        return ScrapyResponse(self.response, self.meta)

Hope this helps.

Gallaecio commented 2 years ago

I think the way to go is to refactor Scrapy's LinkExtractor, and likely move it into a separate package.

With the current API, I guess we could add extract_links (get_links? links? urls?) to SelectableMixin. As for the base implementation to be shared by Scrapy and web-poet, where should it live? w3lib? web-poet?

kmike commented 2 years ago

As for the base implementation to be shared by Scrapy and web-poet

More options: parsel, separate library :)

scrapinghub / web-poet

Discussion for supporting Scrapy's LinkExtractor #10