keyfall / xuexibiji

3 stars 0 forks source link

Scrapy #31

Open keyfall opened 4 years ago

keyfall commented 4 years ago

the scrapy understand

Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。

创建项目

cmd运行scrapy startproject tutorial,新建一个项目 创建一个tutorial目录: tutorial/ scrapy.cfg 部署配置文件 tutorial/ 项目python模块,你从这里导入代码 init.py items.py 项目内容定义文件 middlewares.py 项目中间件文件 pipelines.py 项目管道文件 settings.py 项目设置文件 spidders/ 爬虫的目录
init.py

第一只爬虫

将其保存在名为的 quotes_spider.py文件中,tutorial/spiders 项目中的目录

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name:标识蜘蛛,一个项目中必须是唯一的。 start_requests():必须返回一个ITable of erquests(一个请求列表或者编写一个生成器函数),蜘蛛将从中开始爬行 parse():处理每个请求下载的响应,解析响应

运行

scrapy crawl quotes quotes是quotes_spider.py中name属性

启动请求方法的快捷方式

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

这里跟上面的代码少了start_requests方法,把urls提取了出来,这里是因为parse方法是scrapy的默认解析方法

提取数据

学习如何使用scrappy提取数据的最佳方法是使用 Scrapy shell windows: scrapy shell "http://quotes.toscrape.com/page/1/"

然后使用css语句进行数据提取: 获得title标签中text的内容 response.css('title::text').getall() 获取title标签内容,包含title response.css('title').getall() getall()是返回可能有多个结果,只想得到第一个使用get response.css('title::text').get() 上个方法的替代方法 response.css('title::text')[0].get()

这里使用get()方法,因为避免indexerror返回一个none

除了使用getall(),get(),还可以使用正则re()

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

Extracting data in our spider

use yield python keyword and css selector for a dict that contains text,author,tags from pagination

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

storing the scraped data

scrapy crawl quotes -o quotes.json that will generate an 'quotes.json' file containing all scraped items,serialized in JSON. If you run this command twice without removing the file before the second time, you’ll end up with a broken JSON file.

the other format,JSON Lines: scrapy crawl quotes -o quotes.jl

it's stream-like,you can easily append new record to it. It dosen't have the same problem of JSON when you run twice. as each record is a separate line,you can process big files without having to fit everything in memory.

Extracting data from all pages

we can use css selector to get next page,then use 'scrapy.Request' method and the next page to keep the crawling going through all the pages. it creates a port of loop,following all the links to the next page until it doesn't find one - handy for crawling blogs,forums and other sites with pagination.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

a shortcut for creating Requests

you can use 'response.follow' replace 'scrapy.request'

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

'response.follow' supports relative URLs derectly,it means no need to call urljoin.

you can also pass a selector to 'response.follow' instead of a string:

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

it makes me to use list of urls.

now,for a,you can make it more shorter. 'response.follow' uses their href attribute automatically.

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

you can pass a selector to 'response.follow',not selectors.

keyfall commented 4 years ago

command line tool

Configuration settings

Scrapy will look for configuration parameters in 'scrapy.cfg' files in standard locations: 1./etc/scrapy.cfg or c:\scrapy\scrapy.cfg (system-wide). 2.~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global (user-wide) settings. 3.scrapy.cfg inside a scrapy project’s root. settings are merged in the listed order of preference:3>2>1 project-wide settings>user-wide settings>system-wide settings

sharing the root directory between projects

scrapy.cfg
myproject1/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
myproject2/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py

this is structure of scrapy projects it can contain many project like myproject1 and myproject2 in scrapy projects

[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings

you must define one or more aliases for settings

$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot

then you input code to linux shell you need to modify the code,'set' replace 'export' in windows the real is two setting files,input them in scrapy.cfy's settings.

using the scrapy tool

input 'scrapy' to Scrapy tool,you will get commands:

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test  
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

creating projects

scrapy startproject myproject [project_dir] that will create a scrapy project under the 'project_dir' directory,if 'project_dir' wasn't specified,it will be the same as 'myproject'.

available tool commands

there are two kinds of commands:

Global commands: startproject genspider settings runspider shell fetch view version

Project-only commands: crawl check list edit parse bench

<> means must have [] means optional

startproject syntax:scrapy startproject <project_name> [project_dir]

genspider syntax:scrapy genspider [-t template] <name> <domain> usage example:

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

if called from inside a project,create a new spider in the current project's spiders folder or in the current folder. 'name' parameter is set as the spider's name,domain is used to generate the 'allowed_domains' and 'start_urls'. 'allowed_domains' means domains that is allowed to pass.

crawl scrapy crawl <spider> start crawling using a spider

$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

check scrapy check [-1] [spider] run contract checks

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

list scrapy list list all available spiders in the current projects.

$ scrapy list
spider1
spider2

edit scrapy edit <spider> Edit the given spider using the editor defined in the EDITOR environment variable or the EDITOR setting. scrapy edit spider1 editor setting in settings module.

fetch scrapy fetch <url> downloads the given URL using the Scrapy downloader and writes the contents to standard output. supported options: --spider=SPIDER:bypass spider autodetection and force use of specific spider --headers: print the response's HTTP headers,not response's body --no-redirect:do not follow HTTP 3xx redirects

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

view scrapy view <url> Sometimes spiders see pages differently from regular users, so this can be used to check what the spider “sees” and confirm it’s what you expect. Supported options: --spider=SPIDER: bypass spider autodetection and force use of specific spider --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

shell scrapy shell [url] Starts the scrapy shell for the given URL,'Scrapy shell'module have more info. Supported options: --spider=SPIDER: bypass spider autodetection and force use of specific spider -c code: evaluate the code in the shell, print the result and exit --no-redirect: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url) will still follow HTTP redirects by default.

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

parse scrapy parse <url> [options] Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given. Supported options: --spider=SPIDER: bypass spider autodetection and force use of specific spider --a NAME=VALUE: set spider argument (may be repeated) --callback or -c: spider method to use as callback for parsing the response --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’ --cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’ --pipelines: process items through pipelines --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response --noitems: don’t show scraped items --nolinks: don’t show extracted links --nocolour: avoid using pygments to colorize the output --depth or -d: depth level for which the requests should be followed recursively (default: 1) --verbose or -v: display information for each depth level

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]

need to add --spider or -c for parse function after 'scrapy parse '

settings scrapy settings [options] get the value of a scrapy setting. If used inside a project it’ll show the project setting value, otherwise it’ll show the default Scrapy value for that setting.

options: --help, -h show this help message and exit --get=SETTING print raw setting value --getbool=SETTING print setting value, interpreted as a boolean --getint=SETTING print setting value, interpreted as an integer --getfloat=SETTING print setting value, interpreted as a float --getlist=SETTING print setting value, interpreted as a list

$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

runspider scrapy runspider <spider_file.py> Run a spider self-contained in a Python file, without having to create a project.

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

version scrapy version [-v] Prints the Scrapy version. If used with -v it also prints Python, Twisted and Platform info, which is useful for bug reports.

bench scrapy bench Run a quick benchmark test. Benchmarking module has more info.

Custom project commands

You can also add your custom project commands by using the COMMANDS_MODULE setting. See the Scrapy commands in "https://github.com/scrapy/scrapy/tree/master/scrapy/commands" for examples on how to implement your commands.

commands_module

A module to use for looking up custom Scrapy commands. This is used to add custom commands for your Scrapy project.

COMMANDS_MODULE = 'mybot.commands'

Register commands via setup.py entry points

You can also add Scrapy commands from an external library by adding a scrapy.commands section in the entry points of the library setup.py file.

from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
  entry_points={
    'scrapy.commands': [
      'my_command=my_scrapy_module.commands:MyCommand',
    ],
  },
 )

Custom project commands

keyfall commented 4 years ago

Spiders

Spiders are classes which define how a certain site(or a group of sites)will be scraped. the scraping cycle goes through: 1.You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

2.In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. those requests maybe contain a callback,then scrapy perform that.

3.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

4.Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

scrapy.Spider

class scrapy.spiders.Spider every spider must inherit the class. It just provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.

name

A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy. it must be unique and required. you can instanting multi-instance of the same spider. you can name to the spider that scrapes a single domain.

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl. if 'offsiteMiddleware' is enabled,Requests that not belonging to the 'allowed_domains' won't be followed.

use 'https://www.example.com/1.html',add 'example.com',it's ok.

start_urls

A list of URLs where the spider will begin to crawl from, when no particular URLs are specified.

custom_settings

A dictionary of settings that will be overridden from the project wide configuration when running this spider. For a list of available built-in settings see:Built-in settings reference.

crawler

This attribute is set by the from_crawler() class method after initializating the class, and links to the Crawler object to which this spider instance is bound.

settings

Configuration for running this spider. Settings

logger

Python logger created with the Spider’s name. You can use it to send log messages through it as described on Logging from Spiders.

from_crawler(crawler,*args,**kwargs)

This is the class method used by Scrapy to create your spiders. you don't need to override it. this method sets the crawler and settings attributes in the new instance so they can be accessed later inside the spider’s code. parameters: crawler (Crawler instance) – crawler to which the spider will be bound args (list) – arguments passed to the init() method kwargs (dict) – keyword arguments passed to the init() method

start_requests()

This method must return an iterable with the first Requests to crawl for this spider. Scrapy calls it only once, so it is safe to implement start_requests() as a generator. The default implementation generates Request(url, dont_filter=True) for each url in start_urls.

parse

This is the default callback process downloaded responses.

log(message [,level,component])

Wrapper that sends a log message through the Spider’s logger, kept for backward compatibility. For more information see Logging from Spiders

closed(reason)

Called when the spider closes.

Spider arguments

Spiders can receive arguments that modify their behaviour. Keep in mind that spider arguments are only strings. The spider will not do any parsing on its own. More functions can set arguments: 1.command line: scrapy crawl myspider -a category=electronics

2.initmethod:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        # ...

3.Spider arguments can also be passed through the Scrapyd schedule.json API. See Scrapyd documentation

Generic Spiders

crawlspider

class scrapy.spiders.CrawlSpider This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. a new attribute: rules: a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. an overrideable method: This method is called for the start_urls responses. It allows to parse the initial responses and must return either an Item object, a Request object, or an iterable containing any of them.

crawling rules

scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.

callback is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link extractor. warning:When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

example:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
        # deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
        # allow_domains:会被提取的链接的domains。
        # deny_domains:一定不会被提取链接的domains。
        # restrict_xpaths:使用XPath表达式,和allow共同作用过滤链接。
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
        item['link_text'] = response.meta['link_text']
        return item

XMLFeedSpider

class scrapy.spiders.XMLFeedSpider XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. you can use 'iternodes','xml' and 'html' as iterator,recommend 'iternodes'.

To set the iterator and the tag name, you must define the following class attributes: iterator: A string which defines the iterator to use. It can be either:

'iternodes' - a fast iterator based on regular expressions
'html' - an iterator which uses Selector. Keep in mind this uses DOM parsing and             must load all DOM in memory which could be a problem for big feeds
'xml' - an iterator which uses Selector. Keep in mind this uses DOM parsing and     must load all DOM in memory which could be a problem for big feeds
It defaults to: 'iternodes'.

itertag: A string with the name of the node (or element) to iterate in.

namespaces: A list of (prefix, uri) tuples which define the namespaces available in that document that will be processed with this spider. The prefix and uri will be used to automatically register namespaces using the register_namespace() method. You can then specify nodes with namespaces in the itertag attribute.

    class YourSpider(XMLFeedSpider):

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    # ...

overrideable methods: adapt_response(response): A method that receives the response as soon as it arrives from the spider middleware, before the spider starts parsing it. it means you can modify resposne,then return it to parse module.

parse_node(response, selector): This method is called for the nodes matching the provided tag name (itertag). This method must be override.

process_results(response,results): This method is called for each result (item or request) returned by the spider. it’s intended to perform any last time processing required before returning the results to the framework core

example:

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.getall()))

        item = TestItem()
        item['id'] = node.xpath('@id').get()
        item['name'] = node.xpath('name').get()
        item['description'] = node.xpath('description').get()
        return item

SitemapSpider

class scrapy.spiders.SitemapSpider SitemapSpider allows you to crawl a site by discovering the URLs using Sitemaps. sitemap_urls: A list of urls pointing to the sitemaps whose urls you want to crawl.

sitemap_rules: A list of tuples (regex, callback) where:

    regex is a regular expression to match urls extracted from sitemaps. regex can be     either a str or a compiled regex object.
    callback is the callback to use for processing the urls that match the regular expression. callback can be a string (indicating the name of a spider method) or a callable.

sitemap_rules = [('/product/', 'parse_product')]

sitemap_follow: A list of regexes of sitemap that should be followed. This is only for sites that use Sitemap index files that point to other sitemap files. By default, all sitemaps are followed.

sitemap_alternate_links: Specifies if alternate links for one url should be followed. These are links for the same website in another language passed within the same url block.

    <url>
    <loc>http://example.com/</loc>
    <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
    </url>
With sitemap_alternate_links set, this would retrieve both URLs.
With sitemap_alternate_links disabled, only http://example.com/ would be retrieved.

Default is sitemap_alternate_links disabled.

sitemap_filter(entries): This is a filter function that could be overridden to select sitemap entries based on their attributes.

example:
    <url>
    <loc>http://example.com/</loc>
    <lastmod>2005-01-01</lastmod>
    </url>

filter entries by date

from datetime import datetime
from scrapy.spiders import SitemapSpider

class FilteredSitemapSpider(SitemapSpider):
    name = 'filtered_sitemap_spider'
    allowed_domains = ['example.com']
    sitemap_urls = ['http://example.com/sitemap.xml']

    def sitemap_filter(self, entries):
        for entry in entries:
            date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
            if date_time.year >= 2005:
                yield entry

This would retrieve only entries modified on 2005 and the following years. Entries are dict objects extracted from the sitemap document. Usually, the key is the tag name and the value is the text inside it.

It’s important to notice that:

as the loc attribute is required, entries without this tag are discarded
alternate links are stored in a list with the key alternate (see     sitemap_alternate_links)
namespaces are removed, so lxml tags named as {namespace}tagname become only tagname

SitemapSpider examples: process all urls discovered through sitemaps using the parse callback

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']

    def parse(self, response):
        pass # ... scrape item here ...

Process some urls with certain callback and other urls with a different callback:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/sitemap.xml']
    sitemap_rules = [
        ('/product/', 'parse_product'),
        ('/category/', 'parse_category'),
    ]

    def parse_product(self, response):
        pass # ... scrape product ...

    def parse_category(self, response):
        pass # ... scrape category ...

Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]
    sitemap_follow = ['/sitemap_shops']

    def parse_shop(self, response):
        pass # ... scrape shop here ...

Combine SitemapSpider with other sources of urls:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    sitemap_urls = ['http://www.example.com/robots.txt']
    sitemap_rules = [
        ('/shop/', 'parse_shop'),
    ]

    other_urls = ['http://www.example.com/about']

    def start_requests(self):
        requests = list(super(MySpider, self).start_requests())
        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
        return requests

    def parse_shop(self, response):
        pass # ... scrape shop here ...

    def parse_other(self, response):
        pass # ... scrape other here ...

CSVFeedSpider

class scrapy.spiders.CSVFeedSpider This spider is very similar to the XMLFeedSpider, except that it iterates over rows, instead of nodes. The method that gets called in each iteration is parse_row(). delimiter: A string with the separator character for each field in the CSV file Defaults to ',' .

quotechar: A string with the enclosure character for each field in the CSV file Defaults to '"'

headers: A list of the column names

parse_row(response, row): Receives a response and a dict (representing each row) with a key for each provided (or detected) header of the CSV file. This spider also gives the opportunity to override adapt_response and process_results methods for pre- and post-processing purposes.

example:


from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem

class MySpider(CSVFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.csv']
    delimiter = ';'
    quotechar = "'"
    headers = ['id', 'name', 'description']

    def parse_row(self, response, row):
        self.logger.info('Hi, this is a row!: %r', row)

        item = TestItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['description'] = row['description']
        return item
keyfall commented 4 years ago

Xpath

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

图片 图片 图片 图片 图片 图片 XPath、XQuery 以及 XSLT 函数函数参考手册

keyfall commented 4 years ago

Selectors

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

Using selectors

Constructing selectors

Querying responses using XPath and CSS is so common that responses include two more shortcuts: response.xpath() and response.css():

>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

'Selector' example:

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

Constructing from response - HtmlResponse is one of TextResponse subclasses:

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

Selector automatically chooses the best parsing rules (XML vs HTML) based on input type.

使用选择器

scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html 在shell中输入

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
>>> response.css('title::text').get()
'Example website'

get()返回单个结果,如果有多个匹配项,返回第一个匹配项,没有匹配项,不返回任何内容, getall()返回包含所有结果的列表

.xpath() 和 .css() 方法返回 SelectorList 实例,它是新选择器的列表。此API可用于快速选择嵌套数据:

>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

可以将默认返回值作为参数提供,以代替 None:

>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'

可以使用选择器的attrib属性查询:

>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

CSS选择器的扩展

要选择文本节点,使用::text 选择属性值,使用::attr(name)

title::text 选择子代的子文本节点 <title> 元素:
>>> response.css('title::text').get()
'Example website'

*::text 选择当前选择器上下文的所有子代文本节点::
>>> response.css('#images *::text').getall()
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

foo::text 如果 foo 元素存在,但不包含文本(即文本为空)::
>>> response.css('img::text').getall()
[]

使用 default='' :
>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''

a::attr(href) 选择 href 后代链接的属性值:
>>> response.css('a::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

嵌套选择器

选择方法 (.xpath() 或 .css() )返回同一类型的选择器列表,因此您也可以调用这些选择器的选择方法。

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
...     print('Link number %d points to url %r and image %r' % args)

Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

选择元素属性

.attrib选择器的属性,在代码中查找属性

>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/

选择器和正则一起

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

re.re_first()获得第一个结果
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

使用XPaths

使用以/开头的XPath,则XPath将是文档的绝对形式,而不是与调用它的选择器相关。 使用contains(@class, 'someclass')来弥补这一点,那么您可能会得到更多您想要的元素 *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')] 另外也可以先使用css定位,再使用xpath获取

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']

//node[1] 选择所有首先出现在各自父节点下的节点。 (//node)[1] 选择文档中的所有节点,然后只获取其中的第一个节点。

当需要使用文本内容作为XPath字符串函数的参数时,请避免使用.//text(),而使用.代替。 这是因为表达式 .//text() 生成一个文本元素集合--a node-set . 当一个节点集被转换成一个字符串时,当它作为参数传递给一个字符串函数(如 contains() 或 starts-with() ,它只为第一个元素生成文本。

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> sel.xpath('//a//text()').getall() # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall() # convert it to string
['Click here to go to the ']
>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

xpath表达式中的变量

xpath允许您引用xpath表达式中的变量,使用 $somevariable 语法。 所有变量引用都需要一个绑定值。

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

删除命名空间

文件


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...

由于xmlns,所以我们尝试选择所有 对象,返回空(因为Atom XML命名空间正在混淆这些节点)

>>> response.xpath("//link")
[]

使用Selector.remove_namespaces()方法

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
 <Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
 ...

没有默认调用命名空间移除: 1.代价大,删除命名空间需要迭代和修改文档中的所有节点. 2.有时需要命名空间,以防某些元素名称在名称空间之间发生冲突.

内置选择器

class scrapy.selector.Selector(response=None, text=None, type=None, root=None, **kwargs)

text 是Unicode字符串或UTF-8编码文本,用于 response 不可用。 type 定义选择器类型,它可以是 "html" , "xml" 或 None (默认)。 If type is None and a response is passed, the selector type is inferred from the response type as follows:

"html" for HtmlResponse type "xml" for XmlResponse type "html" for anything else Otherwise, if type is set, the selector type will be forced and no detection will occur.

xpath(query, namespaces=None, **kwargs) Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened.

css(query) Apply the given CSS selector and return a SelectorList instance.

attrib Return the attributes dictionary for underlying element.

re(regex, replace_entities=True) By default, character entity references are replaced by their corresponding character (except for & and <). Passing replace_entities as False switches off these replacements.

register_namespace(prefix, uri) Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces

class scrapy.selector.SelectorList The SelectorList class is a subclass of the builtin list class, which provides a few additional methods. like selector class.

Selector examples on XML response

sel = Selector(xml_response)
sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()

I think namespace is a like key-value storage,value is original address for convince long,repeat address.

keyfall commented 4 years ago

Items

Declaring Items

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

Working with Items

>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product['price']
1000

>>> product['last_updated']
Traceback (most recent call last):
    ...
KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')
not set

>>> product['lala'] # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'

>>> product.get('lala', 'unknown field')
'unknown field'

>>> 'name' in product  # is name field populated?
True

>>> 'last_updated' in product  # is last_updated populated?
False

>>> 'last_updated' in product.fields  # is last_updated a declared field?
True

>>> 'lala' in product.fields  # is lala a declared field?
False

# setting field values
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

# all populated values

>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

copying items shallow copy product2 = product.copy()

deep copy product2 = product.deepcopy()

shallow copy is copy address deep copy is copy content,create a piece of place in rom,get new address,save content to the place.

Extending Items You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.

#add
class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

#change   
class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

scrapy1.8

Field objects

class scrapy.item.Field([arg]) The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes.

Other classes related to Item

class scrapy.item.BaseItem

Base class for all scraped items. In Scrapy, an object is considered an item if it is an instance of either BaseItem or dict. For example, when the output of a spider callback is evaluated, only instances of BaseItem or dict are passed to item pipelines. If you need instances of a custom class to be considered items by Scrapy, you must inherit from either BaseItem or dict. Unlike instances of dict, instances of BaseItem may be tracked to debug memory leaks.

class scrapy.item.ItemMeta Metaclass of Item that handles field definitions.

scrapy 1.7 图片

keyfall commented 4 years ago

Item Loaders

Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.

Using Item Loaders to populate items

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Input and Output processors

l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)

Declaring Item Loaders

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class ProductLoader(ItemLoader):

    default_output_processor = TakeFirst()

    name_in = MapCompose(unicode.title)
    name_out = Join()

    price_in = MapCompose(unicode.strip)

    # ...

Declaring Input and Output Processors

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_price(value):
    if value.isdigit():
        return value

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
>>> il.add_value('price', [u'&euro;', u'<span>1000</span>'])
>>> il.load_item()
{'name': u'Welcome to my website', 'price': u'1000'}

u'sdf',u表示将后面跟的字符串以unicode格式存储 输入和输出处理器的优先顺序如下:

项目加载器字段特定属性: field_in 和 field_out (最优先) 字段元数据 (input_processor 和 output_processor 关键) 项目加载器默认值: ItemLoader.default_input_processor() 和 ItemLoader.default_output_processor() (最低优先级)

Item Loader Context

The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader.They are used to modify the behaviour of the input/output processors.

def parse_length(text, loader_context):
    unit = loader_context.get('unit', 'm')
    # ... length parsing code goes here ...
    return parsed_length

修改项目加载器上下文值得几种方法:

#通过修改当前活动的项加载器上下文 (context 属性):
loader = ItemLoader(product)
loader.context['unit'] = 'cm'

#在项加载器实例化时(项加载器构造函数的关键字参数存储在项加载器上下文中):
loader = ItemLoader(product, unit='cm')

#在项目加载器声明中,用于那些支持用项目加载器上下文实例化它们的输入/输出处理器。 MapCompose 是其中之一:
class ProductLoader(ItemLoader):
    length_out = MapCompose(parse_length, unit='cm')

项加载器对象

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs) Return a new Item Loader for populating the given Item. If no item is given, one is instantiated automatically using the class in default_item_class.

parameters: item-The item instance to populate using subsequent calls to add_xpath(), add_css(), or add_value(). Selector-The selector to extract data from, when using the add_xpath() (resp. add_css()) or replace_xpath() (resp. replace_css()) method. response-The response used to construct the selector using the default_selector_class, unless the selector argument is given, in which case this argument is ignored.

methods: get_value(value, *processors, **kwargs) Process the given value by the given processors and keyword arguments. Available keyword arguments: Parameters: re (str or compiled regex) – a regular expression to use for extracting data from the given value using extract_regex() method, applied before processors

>>> from scrapy.loader.processors import TakeFirst
>>> loader.get_value(u'name: foo', TakeFirst(), unicode.upper, re='name: (.+)')
'FOO`

add_value(field_name, value, *processors, **kwargs) Process and then add the given value for the given field. 首先传递值 get_value() 通过给予 processors 和 kwargs ,然后通过 field input processor 其结果附加到为该字段收集的数据中。如果字段已包含收集的数据,则添加新数据。

给定的 field_name 可以是 None ,在这种情况下,可以添加多个字段的值。处理后的值应该是字段名映射到值的dict。

loader.add_value('name', u'Color TV')
loader.add_value('colours', [u'white', u'blue'])
loader.add_value('length', u'100')
loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {'name': u'foo', 'sex': u'male'})

replace_value(field_name, value, *processors, **kwargs) 将收集到的数值替换为新值.

get_xpath(xpath, *processors, **kwargs) is used to extract a list of unicode strings from the selector associated with this ItemLoader. Parameters: xpath (str) – the XPath to extract data from re (str or compiled regex) – a regular expression to use for extracting data from the selected XPath region

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')

add_xpath(field_name, xpath, *processors, **kwargs) used to extract a list of unicode strings from the selector associated with this ItemLoader.

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')

get_css(css, *processors, **kwargs)

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')

add_css(field_name, css, *processors, **kwargs)

# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')

replace_css(field_name, css, *processors, **kwargs)

load_item() Populate the item with the data collected so far, and return it.

nested_xpath(xpath) Create a nested loader with an xpath selector.

get_collected_values(field_name) Return the collected values for the given field.

get_output_value(field_name) Return the collected values parsed using the output processor, for the given field.

get_input_processor(field_name) Return the input processor for the given field.

get_output_processor(field_name) Return the output processor for the given field.

attributes: item:The Item object being parsed by this Item Loader.

context 当前活动的 Context 此项目加载器的。

default_item_class 项类(或工厂),用于在构造函数中未给定项时实例化项。

default_input_processor 用于那些未指定字段的默认输入处理器。

default_output_processor 用于未指定字段的默认输出处理器。

default_selector_class 用于构造 selector 其中 ItemLoader ,如果构造函数中只给出响应。如果在构造函数中给定了选择器,则忽略此属性。此属性有时在子类中被重写。

selector 这个 Selector 从中提取数据的对象。它要么是构造函数中给定的选择器,要么是使用 default_selector_class . 此属性是只读的。

Nested Loaders

When parsing related values from a subsection of a document, it can be useful to create nested loaders. example:

<footer>
    <a class="social" href="https://facebook.com/whatever">Like Us</a>
    <a class="social" href="https://twitter.com/whatever">Follow Us</a>
    <a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>

no nested loaders

loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath('social', '//footer/a[@class = "social"]/@href')
loader.add_xpath('email', '//footer/a[@class = "email"]/@href')
loader.load_item()

nested loaders

loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()

Reusing and extending Item Loaders

remove those dashes by reusing and extending the default Product Item Loader (ProductLoader):

from scrapy.loader.processors import MapCompose
from myproject.ItemLoaders import ProductLoader

def strip_dashes(x):
    return x.strip('-')

class SiteSpecificLoader(ProductLoader):
    name_in = MapCompose(strip_dashes, ProductLoader.name_in)

Available built-in processors

class scrapy.loader.processors.Identity 最简单的处理器,它什么都不做。它返回原始值不变。它不接收任何构造函数参数,也不接受加载程序上下文。

>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']

class scrapy.loader.processors.TakeFirst 从接收的值返回第一个非空/非空值,因此它通常用作单值字段的输出处理器。它不接收任何构造函数参数,也不接受加载程序上下文。

>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'

class scrapy.loader.processors.Join(separator=u' ') 返回与构造函数中给定的分隔符联接的值,默认为 u' ' . 它不接受加载器上下文。 使用默认分隔符时,此处理器相当于以下函数: u' '.join

>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
'one<br>two<br>three'

class scrapy.loader.processors.Compose(*functions, **default_loader_context) 一种由给定函数组成的处理器。这意味着该处理器的每个输入值都传递给第一个函数,该函数的结果传递给第二个函数,依此类推,直到最后一个函数返回该处理器的输出值为止。 By default, stop process on None value. This behaviour can be changed by passing keyword argument stop_on_none=False.

>>> from scrapy.loader.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['hello', 'world'])
'HELLO'

class scrapy.loader.processors.MapCompose(*functions, **default_loader_context) 此处理器的输入值为 迭代的 第一个函数应用于每个元素。这些函数调用的结果(每个元素一个)被连接起来,以构造一个新的iterable,然后使用它来应用第二个函数,依此类推,直到最后一个函数应用到迄今为止收集的值列表的每个值为止。最后一个函数的输出值被连接在一起以产生这个处理器的输出。

compase收到的是一整个list,mapcompose是从list里面取出一个一个进行传入.

class scrapy.loader.processors.SelectJmes(json_path) 使用提供给构造函数的JSON路径查询值并返回输出。

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
>>> proc({'foo': 'bar'})
'bar'
>>> proc({'foo': {'bar': 'baz'}})
{'bar': 'baz'}

>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo')))
>>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]')
['bar']
keyfall commented 4 years ago

Scrapy shell

配置shell

官方建议使用ipython替换python shell pip install ipython

可以设置 SCRAPY_PYTHON_SHELL 环境变量 也可以在scrapy.cfg中进行修改:

[settings]
shell = bpython

Launch the shell

scrapy shell <url> local file

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

using the shell

shortcuts: shelp() -打印有关可用对象和快捷方式列表的帮助 fetch(url[, redirect=True]) - 从给定的URL获取一个新的响应,并相应地更新所有相关的对象。您可以选择要求HTTP 3xx重定向后不传递重定向=False fetch(request) -从给定的请求中获取新的响应,并相应地更新所有相关对象。 view(response) -在本地Web浏览器中打开给定的响应以进行检查。这将增加一个 tag 到响应主体,以便外部链接(如图像和样式表)正确显示。但是请注意,这将在您的计算机中创建一个临时文件,该文件不会自动删除。

Scrapy objects: crawler - the current Crawler object. spider - the Spider which is known to handle the URL, or a Spider object if there is no spider found for the current URL request - a Request object of the last fetched page. You can modify this request using replace() or fetch a new request (without leaving the shell) using the fetch shortcut. response - a Response object containing the last fetched page settings - the current Scrapy settings

keyfall commented 4 years ago

Item Pipeline

After an item has been scraped by a spider, it is sent to the Item Pipeline. 就是进行拿到的数据进行处理

Writing your own item pipeline

每个item pipeline组件都是一个python类,必须实现以下方法: This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components. Parameters: item (Item object or a dict) – the item scraped spider (Spider object) – the spider which scraped the item 还可以实现以下方法: open_spider(self, spider): spider打开时调用此方法

close_spider(self, spider): 当spider关闭时调用此方法。

from_crawler(cls,crawler): 如果存在,则调用此ClassMethod从 Crawler . 它必须返回管道的新实例。爬虫对象提供对所有零碎核心组件(如设置和信号)的访问;它是管道访问它们并将其功能连接到scrapy的一种方式。

项目管道示例

Price validation and dropping items with no prices:

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

将项目写入JSON文件:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

将项目写入MongoDB:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

重复筛选器 查找重复项并删除已处理的项的筛选器。假设我们的项目有一个唯一的ID,但是我们的spider返回具有相同ID的多个项目:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

激活项目管道组件

若要激活项管道组件,必须将其类添加到 ITEM_PIPELINES 设置,如以下示例中所示: ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, } 在此设置中分配给类的整数值决定了它们的运行顺序:项从低值类传递到高值类。习惯上把这些数字定义在0-1000范围内。

mugpeng commented 4 years ago

你好,请问一下。 我在编辑好字典内容抓取指定页面信息保存json格式文件后,发现生成的json文件为空。反复删除试了几次,并且也用了您的代码,另外也换了.jl, .csv 还是这样。请问您有遇到过这种情况吗? 谢谢

keyfall commented 4 years ago

你好,请问一下。 我在编辑好字典内容抓取指定页面信息保存json格式文件后,发现生成的json文件为空。反复删除试了几次,并且也用了您的代码,另外也换了.jl, .csv 还是这样。请问您有遇到过这种情况吗? 谢谢

你好,我不知道你说的是哪个,能把代码截图或直接代码块发到评论里么,我看一下,我刚试了一下 image 这个代码写错了,没有::text,删掉就好了