Quick Tutorial for Scrapy

Architecture overview

Creating a project

$ scrapy startproject tutorial

$ tree
tutorial
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── __pycache__

4 directories, 7 files

Writing a spider

Spiders must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

Create a new spider tutorial/spiders/quotes_spider.py:

import scrapy

# Scrapy spiders must subclass `scrapy.Spider`
class QuotesSpider(scrapy.Spider):
    # `name`: identifies the Spider, and must be unique within a project.
    name = "quotes"

    # `start_requests`: must return an iterable of Requests (list or generator)
    # which the Spider will begin to crawl from. Subsequent requests will be
    # generated successively from these initial requests.
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # A shortcut to the start_requests method
    # start_urls = [
    #     'http://quotes.toscrape.com/page/1/',
    #     'http://quotes.toscrape.com/page/2/',
    # ]

    # `parse`: a method that will be called to handle the response downloaded
    # for each of the requests made.
    # The `response` parameter is an instance of `TextResponse` that holds the
    # page content and has further helpful methods to handle it.
    # The parse() method usually parses the response, extracting the scraped
    # data as dicts and also finding new URLs to follow and creating new
    # requests (Request) from them.
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            # What you see here is Scrapy’s mechanism of following links: when
            # you yield a Request in a callback method, Scrapy will schedule
            # that request to be sent and register a callback method to be
            # executed when that request finishes.
            yield scrapy.Request(next_page, callback=self.parse)

        # A shortcut for creating Requests
        # next_page = response.css('li.next a::attr(href)').extract_first()
        # if next_page is not None:
        #     yield response.follow(next_page, callback=self.parse)

        # Or follow all link
        # for a in response.css('li.next a'):
        #     yield response.follow(a, callback=self.parse)

Run a spider

$ scrapy crawl quotes -o quotes.json

Play with Scrapy shell

$ scrapy shell 'http://quotes.toscrape.com/page/1/'

References

Scrapy Tutorial

jiansoung / issues-list

Quick Tutorial for Scrapy #12