scrapinghub / scrapy-autoextract

Zyte Automatic Extraction integration for Scrapy
BSD 3-Clause "New" or "Revised" License
55 stars 17 forks source link

==================================== Scrapy & Autoextract API integration

.. image:: https://img.shields.io/pypi/v/scrapy-autoextract.svg :target: https://pypi.org/project/scrapy-autoextract/ :alt: PyPI Version

.. image:: https://img.shields.io/pypi/pyversions/scrapy-autoextract.svg :target: https://pypi.org/project/scrapy-autoextract/ :alt: Supported Python Versions

.. image:: https://github.com/scrapinghub/scrapy-autoextract/workflows/tox/badge.svg :target: https://github.com/scrapinghub/scrapy-autoextract/actions :alt: Build Status

.. image:: https://codecov.io/gh/scrapinghub/scrapy-autoextract/branch/master/graph/badge.svg?token=D6DQUSkios :target: https://codecov.io/gh/scrapinghub/scrapy-autoextract :alt: Coverage report

This library integrates Zyte's AI Enabled Automatic Data Extraction into a Scrapy spider by two different means:

Installation

::

pip install scrapy-autoextract

scrapy-autoextract requires Python 3.7+ for the download middleware and Python 3.7+ for the scrapy-poet provider

Usage

There are two different ways to consume the AutoExtract API with this library:

The middleware

The middleware is opt-in and can be explicitly enabled per request, with the {'autoextract': {'enabled': True}} request meta. All the options below can be set either in the project settings file, or just for specific spiders, in the custom_settings dict.

Within the spider, consuming the AutoExtract result is as easy as::

def parse(self, response):
    yield response.meta['autoextract']

Configuration ^^^^^^^^^^^^^

Add the AutoExtract downloader middleware in the settings file::

DOWNLOADER_MIDDLEWARES = {
    'scrapy_autoextract.AutoExtractMiddleware': 543,
}

Note that this should be the last downloader middleware to be executed.

The providers

Another way of consuming AutoExtract API is using the Page Objects pattern proposed by the web-poet library and implemented by scrapy-poet.

Items returned by Page Objects are defined in the autoextract-poet_ library.

Within the spider, consuming the AutoExtract result is as easy as::

import scrapy
from autoextract_poet.pages import AutoExtractArticlePage

class SampleSpider(scrapy.Spider):
    name = "sample"

    def parse(self, response, article_page: AutoExtractArticlePage):
        # We're making two requests here:
        # - one through Scrapy to build the response argument
        # - the other through the providers to build the article_page argument
        yield article_page.to_item()

Note that on the example above, we're going to perform two requests:

If you don't need the additional request going through Scrapy, you can annotate the response argument of your callback with DummyResponse. This will ignore the Scrapy request and only the AutoExtract API will be fetched.

For example::

import scrapy
from autoextract_poet.pages import AutoExtractArticlePage
from scrapy_poet import DummyResponse

class SampleSpider(scrapy.Spider):
    name = "sample"

    def parse(self, response: DummyResponse, article_page: AutoExtractArticlePage):
        # We're making a single request here to build the article argument
        yield article_page.to_item()

The examples above extract an article from the page, but you may want to extract a different type of item, like a product or a job posting. It is as easy as using the correct type annotation in the callback. This is how the callback looks like if we need to extract a real state from the page::

def parse(self,
          response: DummyResponse,
          real_estate_page: AutoExtractRealEstatePage):
    yield real_estate_page.to_item()

You can even use AutoExtractWebPage if what you need is the raw browser HTML to extract some additional data. Visit the full list of supported page types <https://docs.zyte.com/automatic-extraction.html#result-fields>_ to get a better idea of the supported pages.

Lastly, if you have a an AutoExtract subscription with fullHtml set to True, you can access the HTML data that was used by AutoExtract in case you need it. Here's an example:

.. code-block:: python

def parse_product(self, response: DummyResponse, product_page: AutoExtractProductPage, html_page: AutoExtractWebPage):
    product_item = product_page.to_item()

    # You can easily interact with the html_page using these selectors.
    html_page.css(...)
    html_page.xpath(...)

Configuration ^^^^^^^^^^^^^

First, you need to configure scrapy-poet as described on scrapy-poet's documentation_ and then enable AutoExtract providers by putting the following code to Scrapy's settings.py file::

# Install AutoExtract provider
SCRAPY_POET_PROVIDERS = {"scrapy_autoextract.AutoExtractProvider": 500}

# Enable scrapy-poet's provider injection middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy_poet.InjectionMiddleware': 543,
}

# Configure Twisted's reactor for asyncio support on Scrapy
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Currently, our providers are implemented using asyncio. Scrapy has introduced asyncio support since version 2.0 but as of Scrapy 2.3 you need to manually enable it by configuring Twisted's default reactor. Check Scrapy's asyncio documentation_ for more information.

Checklist:

Now you should be ready to use our AutoExtract providers.

Exceptions ^^^^^^^^^^

While trying to fetch AutoExtract API, providers might raise some exceptions. Those exceptions might come from scrapy-autoextract providers themselves, zyte-autoextract_, or by other means (e.g. ConnectionError). For example:

Check zyte-autoextract's async errors_ for other exception definitions.

You can capture those exceptions using an error callback (errback)::

import scrapy
from autoextract.aio.errors import RequestError
from autoextract_poet.pages import AutoExtractArticlePage
from scrapy_autoextract.errors import QueryError
from scrapy_poet import DummyResponse
from twisted.python.failure import Failure

class SampleSpider(scrapy.Spider):
    name = "sample"
    urls = [...]

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url, callback=self.parse_article,
                                 errback=self.errback_article)

    def parse_article(self, response: DummyResponse,
                      article_page: AutoExtractArticlePage):
        yield article_page.to_item()

    def errback_article(self, failure: Failure):
        if failure.check(RequestError):
            self.logger.error(f"RequestError on {failure.request.url}")

        if failure.check(QueryError):
            self.logger.error(f"QueryError: {failure.value.message}")

See Scrapy documentation <https://docs.scrapy.org/en/latest/topics/request-response.html#using-errbacks-to-catch-exceptions-in-request-processing>_ for more details on how to capture exceptions using request's errback.

Settings

Middleware settings

Provider settings

Limitations

When using the AutoExtract middleware, there are some limitations.

When using the AutoExtract providers, be aware that:

.. web-poet: https://github.com/scrapinghub/web-poet .. scrapy-poet: https://github.com/scrapinghub/scrapy-poet .. autoextract-poet: https://github.com/scrapinghub/autoextract-poet .. zyte-autoextract: https://github.com/zytedata/zyte-autoextract .. zyte-autoextract's async errors: https://github.com/zytedata/zyte-autoextract/blob/master/autoextract/aio/errors.py .. scrapy-poet's documentation: https://scrapy-poet.readthedocs.io/en/latest/intro/tutorial.html#configuring-the-project .. Scrapy's asyncio documentation: https://docs.scrapy.org/en/latest/topics/asyncio.html .. Request-level error: https://doc.scrapinghub.com/autoextract.html#request-level .. Query-level error: https://doc.scrapinghub.com/autoextract.html#query-level .. supported page types: https://autoextract-poet.readthedocs.io/en/stable/_autosummary/autoextract_poet.pages.html#module-autoextract_poet.pages