Open keyfall opened 4 years ago
command line tool
Scrapy will look for configuration parameters in 'scrapy.cfg' files in standard locations: 1./etc/scrapy.cfg or c:\scrapy\scrapy.cfg (system-wide). 2.~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global (user-wide) settings. 3.scrapy.cfg inside a scrapy project’s root. settings are merged in the listed order of preference:3>2>1 project-wide settings>user-wide settings>system-wide settings
scrapy.cfg
myproject1/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
myproject2/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
this is structure of scrapy projects it can contain many project like myproject1 and myproject2 in scrapy projects
[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings
you must define one or more aliases for settings
$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot
then you input code to linux shell you need to modify the code,'set' replace 'export' in windows the real is two setting files,input them in scrapy.cfy's settings.
input 'scrapy' to Scrapy tool,you will get commands:
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
scrapy startproject myproject [project_dir]
that will create a scrapy project under the 'project_dir' directory,if 'project_dir' wasn't specified,it will be the same as 'myproject'.
there are two kinds of commands:
Global commands: startproject genspider settings runspider shell fetch view version
Project-only commands: crawl check list edit parse bench
<> means must have [] means optional
startproject
syntax:scrapy startproject <project_name> [project_dir]
genspider
syntax:scrapy genspider [-t template] <name> <domain>
usage example:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider example example.com
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'
if called from inside a project,create a new spider in the current project's spiders folder or in the current folder. 'name' parameter is set as the spider's name,domain is used to generate the 'allowed_domains' and 'start_urls'. 'allowed_domains' means domains that is allowed to pass.
crawl
scrapy crawl <spider>
start crawling using a spider
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]
check
scrapy check [-1] [spider]
run contract checks
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list
scrapy list
list all available spiders in the current projects.
$ scrapy list
spider1
spider2
edit
scrapy edit <spider>
Edit the given spider using the editor defined in the EDITOR environment variable or the EDITOR setting.
scrapy edit spider1
editor setting in settings module.
fetch
scrapy fetch <url>
downloads the given URL using the Scrapy downloader and writes the contents to standard output.
supported options:
--spider=SPIDER:bypass spider autodetection and force use of specific spider
--headers: print the response's HTTP headers,not response's body
--no-redirect:do not follow HTTP 3xx redirects
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263 '],
'Connection': ['close '],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}
view
scrapy view <url>
Sometimes spiders see pages differently from regular users, so this can be used to check what the spider “sees” and confirm it’s what you expect.
Supported options:
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
shell
scrapy shell [url]
Starts the scrapy shell for the given URL,'Scrapy shell'module have more info.
Supported options:
--spider=SPIDER: bypass spider autodetection and force use of specific spider
-c code: evaluate the code in the shell, print the result and exit
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url) will still follow HTTP redirects by default.
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
parse
scrapy parse <url> [options]
Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.
Supported options:
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback for parsing the response
--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’
--cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': 'Example item',
'category': 'Furniture',
'length': '12 cm'}]
# Requests -----------------------------------------------------------------
[]
need to add --spider or -c for parse function after 'scrapy parse
settings
scrapy settings [options]
get the value of a scrapy setting.
If used inside a project it’ll show the project setting value, otherwise it’ll show the default Scrapy value for that setting.
options: --help, -h show this help message and exit --get=SETTING print raw setting value --getbool=SETTING print setting value, interpreted as a boolean --getint=SETTING print setting value, interpreted as an integer --getfloat=SETTING print setting value, interpreted as a float --getlist=SETTING print setting value, interpreted as a list
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
runspider
scrapy runspider <spider_file.py>
Run a spider self-contained in a Python file, without having to create a project.
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
version
scrapy version [-v]
Prints the Scrapy version. If used with -v it also prints Python, Twisted and Platform info, which is useful for bug reports.
bench
scrapy bench
Run a quick benchmark test.
Benchmarking module has more info.
You can also add your custom project commands by using the COMMANDS_MODULE setting. See the Scrapy commands in "https://github.com/scrapy/scrapy/tree/master/scrapy/commands" for examples on how to implement your commands.
A module to use for looking up custom Scrapy commands. This is used to add custom commands for your Scrapy project.
COMMANDS_MODULE = 'mybot.commands'
You can also add Scrapy commands from an external library by adding a scrapy.commands section in the entry points of the library setup.py file.
from setuptools import setup, find_packages
setup(name='scrapy-mymodule',
entry_points={
'scrapy.commands': [
'my_command=my_scrapy_module.commands:MyCommand',
],
},
)
Spiders are classes which define how a certain site(or a group of sites)will be scraped. the scraping cycle goes through: 1.You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
2.In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. those requests maybe contain a callback,then scrapy perform that.
3.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
4.Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
class scrapy.spiders.Spider every spider must inherit the class. It just provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.
A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy. it must be unique and required. you can instanting multi-instance of the same spider. you can name to the spider that scrapes a single domain.
An optional list of strings containing domains that this spider is allowed to crawl. if 'offsiteMiddleware' is enabled,Requests that not belonging to the 'allowed_domains' won't be followed.
use 'https://www.example.com/1.html',add 'example.com',it's ok.
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified.
A dictionary of settings that will be overridden from the project wide configuration when running this spider. For a list of available built-in settings see:Built-in settings reference.
This attribute is set by the from_crawler() class method after initializating the class, and links to the Crawler object to which this spider instance is bound.
Configuration for running this spider. Settings
Python logger created with the Spider’s name. You can use it to send log messages through it as described on Logging from Spiders.
This is the class method used by Scrapy to create your spiders. you don't need to override it. this method sets the crawler and settings attributes in the new instance so they can be accessed later inside the spider’s code. parameters: crawler (Crawler instance) – crawler to which the spider will be bound args (list) – arguments passed to the init() method kwargs (dict) – keyword arguments passed to the init() method
This method must return an iterable with the first Requests to crawl for this spider. Scrapy calls it only once, so it is safe to implement start_requests() as a generator. The default implementation generates Request(url, dont_filter=True) for each url in start_urls.
This is the default callback process downloaded responses.
Wrapper that sends a log message through the Spider’s logger, kept for backward compatibility. For more information see Logging from Spiders
Called when the spider closes.
Spiders can receive arguments that modify their behaviour.
Keep in mind that spider arguments are only strings. The spider will not do any parsing on its own.
More functions can set arguments:
1.command line:
scrapy crawl myspider -a category=electronics
2.initmethod:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
# ...
3.Spider arguments can also be passed through the Scrapyd schedule.json API. See Scrapyd documentation
class scrapy.spiders.CrawlSpider This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. a new attribute: rules: a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. an overrideable method: This method is called for the start_urls responses. It allows to parse the initial responses and must return either an Item object, a Request object, or an iterable containing any of them.
scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)
link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.
callback is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link extractor. warning:When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.
process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.
example:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
# deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
# allow_domains:会被提取的链接的domains。
# deny_domains:一定不会被提取链接的domains。
# restrict_xpaths:使用XPath表达式,和allow共同作用过滤链接。
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
item['link_text'] = response.meta['link_text']
return item
class scrapy.spiders.XMLFeedSpider XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. you can use 'iternodes','xml' and 'html' as iterator,recommend 'iternodes'.
To set the iterator and the tag name, you must define the following class attributes: iterator: A string which defines the iterator to use. It can be either:
'iternodes' - a fast iterator based on regular expressions
'html' - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load all DOM in memory which could be a problem for big feeds
'xml' - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load all DOM in memory which could be a problem for big feeds
It defaults to: 'iternodes'.
itertag: A string with the name of the node (or element) to iterate in.
namespaces: A list of (prefix, uri) tuples which define the namespaces available in that document that will be processed with this spider. The prefix and uri will be used to automatically register namespaces using the register_namespace() method. You can then specify nodes with namespaces in the itertag attribute.
class YourSpider(XMLFeedSpider):
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
# ...
overrideable methods: adapt_response(response): A method that receives the response as soon as it arrives from the spider middleware, before the spider starts parsing it. it means you can modify resposne,then return it to parse module.
parse_node(response, selector): This method is called for the nodes matching the provided tag name (itertag). This method must be override.
process_results(response,results): This method is called for each result (item or request) returned by the spider. it’s intended to perform any last time processing required before returning the results to the framework core
example:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.getall()))
item = TestItem()
item['id'] = node.xpath('@id').get()
item['name'] = node.xpath('name').get()
item['description'] = node.xpath('description').get()
return item
class scrapy.spiders.SitemapSpider SitemapSpider allows you to crawl a site by discovering the URLs using Sitemaps. sitemap_urls: A list of urls pointing to the sitemaps whose urls you want to crawl.
sitemap_rules: A list of tuples (regex, callback) where:
regex is a regular expression to match urls extracted from sitemaps. regex can be either a str or a compiled regex object.
callback is the callback to use for processing the urls that match the regular expression. callback can be a string (indicating the name of a spider method) or a callable.
sitemap_rules = [('/product/', 'parse_product')]
sitemap_follow: A list of regexes of sitemap that should be followed. This is only for sites that use Sitemap index files that point to other sitemap files. By default, all sitemaps are followed.
sitemap_alternate_links: Specifies if alternate links for one url should be followed. These are links for the same website in another language passed within the same url block.
<url>
<loc>http://example.com/</loc>
<xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
</url>
With sitemap_alternate_links set, this would retrieve both URLs.
With sitemap_alternate_links disabled, only http://example.com/ would be retrieved.
Default is sitemap_alternate_links disabled.
sitemap_filter(entries): This is a filter function that could be overridden to select sitemap entries based on their attributes.
example:
<url>
<loc>http://example.com/</loc>
<lastmod>2005-01-01</lastmod>
</url>
filter entries by date
from datetime import datetime
from scrapy.spiders import SitemapSpider
class FilteredSitemapSpider(SitemapSpider):
name = 'filtered_sitemap_spider'
allowed_domains = ['example.com']
sitemap_urls = ['http://example.com/sitemap.xml']
def sitemap_filter(self, entries):
for entry in entries:
date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
if date_time.year >= 2005:
yield entry
This would retrieve only entries modified on 2005 and the following years. Entries are dict objects extracted from the sitemap document. Usually, the key is the tag name and the value is the text inside it.
It’s important to notice that:
as the loc attribute is required, entries without this tag are discarded
alternate links are stored in a list with the key alternate (see sitemap_alternate_links)
namespaces are removed, so lxml tags named as {namespace}tagname become only tagname
SitemapSpider examples: process all urls discovered through sitemaps using the parse callback
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...
Process some urls with certain callback and other urls with a different callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Combine SitemapSpider with other sources of urls:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
other_urls = ['http://www.example.com/about']
def start_requests(self):
requests = list(super(MySpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
def parse_shop(self, response):
pass # ... scrape shop here ...
def parse_other(self, response):
pass # ... scrape other here ...
class scrapy.spiders.CSVFeedSpider This spider is very similar to the XMLFeedSpider, except that it iterates over rows, instead of nodes. The method that gets called in each iteration is parse_row(). delimiter: A string with the separator character for each field in the CSV file Defaults to ',' .
quotechar: A string with the enclosure character for each field in the CSV file Defaults to '"'
headers: A list of the column names
parse_row(response, row): Receives a response and a dict (representing each row) with a key for each provided (or detected) header of the CSV file. This spider also gives the opportunity to override adapt_response and process_results methods for pre- and post-processing purposes.
example:
from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
quotechar = "'"
headers = ['id', 'name', 'description']
def parse_row(self, response, row):
self.logger.info('Hi, this is a row!: %r', row)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
Querying responses using XPath and CSS is so common that responses include two more shortcuts: response.xpath() and response.css():
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'
'Selector' example:
>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'
Constructing from response - HtmlResponse is one of TextResponse subclasses:
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'
Selector automatically chooses the best parsing rules (XML vs HTML) based on input type.
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在shell中输入
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
>>> response.css('title::text').get()
'Example website'
get()返回单个结果,如果有多个匹配项,返回第一个匹配项,没有匹配项,不返回任何内容, getall()返回包含所有结果的列表
.xpath() 和 .css() 方法返回 SelectorList 实例,它是新选择器的列表。此API可用于快速选择嵌套数据:
>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
可以将默认返回值作为参数提供,以代替 None:
>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
可以使用选择器的attrib属性查询:
>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
要选择文本节点,使用::text 选择属性值,使用::attr(name)
title::text 选择子代的子文本节点 <title> 元素:
>>> response.css('title::text').get()
'Example website'
*::text 选择当前选择器上下文的所有子代文本节点::
>>> response.css('#images *::text').getall()
['\n ',
'Name: My image 1 ',
'\n ',
'Name: My image 2 ',
'\n ',
'Name: My image 3 ',
'\n ',
'Name: My image 4 ',
'\n ',
'Name: My image 5 ',
'\n ']
foo::text 如果 foo 元素存在,但不包含文本(即文本为空)::
>>> response.css('img::text').getall()
[]
使用 default='' :
>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''
a::attr(href) 选择 href 后代链接的属性值:
>>> response.css('a::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
选择方法 (.xpath() 或 .css() )返回同一类型的选择器列表,因此您也可以调用这些选择器的选择方法。
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
... args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
... print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'
.attrib选择器的属性,在代码中查找属性
>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']
re.re_first()获得第一个结果
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'
使用以/开头的XPath,则XPath将是文档的绝对形式,而不是与调用它的选择器相关。
使用contains(@class, 'someclass')来弥补这一点,那么您可能会得到更多您想要的元素
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
另外也可以先使用css定位,再使用xpath获取
>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']
//node[1] 选择所有首先出现在各自父节点下的节点。 (//node)[1] 选择文档中的所有节点,然后只获取其中的第一个节点。
当需要使用文本内容作为XPath字符串函数的参数时,请避免使用.//text(),而使用.代替。 这是因为表达式 .//text() 生成一个文本元素集合--a node-set . 当一个节点集被转换成一个字符串时,当它作为参数传递给一个字符串函数(如 contains() 或 starts-with() ,它只为第一个元素生成文本。
>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> sel.xpath('//a//text()').getall() # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall() # convert it to string
['Click here to go to the ']
>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
xpath允许您引用xpath表达式中的变量,使用 $somevariable 语法。 所有变量引用都需要一个绑定值。
>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'
文件
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:blogger="http://schemas.google.com/blogger/2008"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
...
由于xmlns,所以我们尝试选择所有 对象,返回空(因为Atom XML命名空间正在混淆这些节点)
>>> response.xpath("//link")
[]
使用Selector.remove_namespaces()方法
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
<Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
...
没有默认调用命名空间移除: 1.代价大,删除命名空间需要迭代和修改文档中的所有节点. 2.有时需要命名空间,以防某些元素名称在名称空间之间发生冲突.
class scrapy.selector.Selector(response=None, text=None, type=None, root=None, **kwargs)
text 是Unicode字符串或UTF-8编码文本,用于 response 不可用。 type 定义选择器类型,它可以是 "html" , "xml" 或 None (默认)。 If type is None and a response is passed, the selector type is inferred from the response type as follows:
"html" for HtmlResponse type "xml" for XmlResponse type "html" for anything else Otherwise, if type is set, the selector type will be forced and no detection will occur.
xpath(query, namespaces=None, **kwargs) Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened.
css(query) Apply the given CSS selector and return a SelectorList instance.
attrib Return the attributes dictionary for underlying element.
re(regex, replace_entities=True) By default, character entity references are replaced by their corresponding character (except for & and <). Passing replace_entities as False switches off these replacements.
register_namespace(prefix, uri) Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces
class scrapy.selector.SelectorList
The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.
like selector class.
sel = Selector(xml_response)
sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()
I think namespace is a like key-value storage,value is original address for convince long,repeat address.
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
# setting field values
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
# all populated values
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
copying items
shallow copy
product2 = product.copy()
deep copy
product2 = product.deepcopy()
shallow copy is copy address deep copy is copy content,create a piece of place in rom,get new address,save content to the place.
Extending Items You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.
#add
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
#change
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
scrapy1.8
class scrapy.item.Field([arg]) The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes.
class scrapy.item.BaseItem
Base class for all scraped items. In Scrapy, an object is considered an item if it is an instance of either BaseItem or dict. For example, when the output of a spider callback is evaluated, only instances of BaseItem or dict are passed to item pipelines. If you need instances of a custom class to be considered items by Scrapy, you must inherit from either BaseItem or dict. Unlike instances of dict, instances of BaseItem may be tracked to debug memory leaks.
class scrapy.item.ItemMeta Metaclass of Item that handles field definitions.
scrapy 1.7
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock]')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(unicode.title)
name_out = Join()
price_in = MapCompose(unicode.strip)
# ...
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
>>> il.add_value('price', [u'€', u'<span>1000</span>'])
>>> il.load_item()
{'name': u'Welcome to my website', 'price': u'1000'}
u'sdf',u表示将后面跟的字符串以unicode格式存储 输入和输出处理器的优先顺序如下:
项目加载器字段特定属性: field_in 和 field_out (最优先) 字段元数据 (input_processor 和 output_processor 关键) 项目加载器默认值: ItemLoader.default_input_processor() 和 ItemLoader.default_output_processor() (最低优先级)
The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader.They are used to modify the behaviour of the input/output processors.
def parse_length(text, loader_context):
unit = loader_context.get('unit', 'm')
# ... length parsing code goes here ...
return parsed_length
修改项目加载器上下文值得几种方法:
#通过修改当前活动的项加载器上下文 (context 属性):
loader = ItemLoader(product)
loader.context['unit'] = 'cm'
#在项加载器实例化时(项加载器构造函数的关键字参数存储在项加载器上下文中):
loader = ItemLoader(product, unit='cm')
#在项目加载器声明中,用于那些支持用项目加载器上下文实例化它们的输入/输出处理器。 MapCompose 是其中之一:
class ProductLoader(ItemLoader):
length_out = MapCompose(parse_length, unit='cm')
class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs) Return a new Item Loader for populating the given Item. If no item is given, one is instantiated automatically using the class in default_item_class.
parameters: item-The item instance to populate using subsequent calls to add_xpath(), add_css(), or add_value(). Selector-The selector to extract data from, when using the add_xpath() (resp. add_css()) or replace_xpath() (resp. replace_css()) method. response-The response used to construct the selector using the default_selector_class, unless the selector argument is given, in which case this argument is ignored.
methods: get_value(value, *processors, **kwargs) Process the given value by the given processors and keyword arguments. Available keyword arguments: Parameters: re (str or compiled regex) – a regular expression to use for extracting data from the given value using extract_regex() method, applied before processors
>>> from scrapy.loader.processors import TakeFirst
>>> loader.get_value(u'name: foo', TakeFirst(), unicode.upper, re='name: (.+)')
'FOO`
add_value(field_name, value, *processors, **kwargs) Process and then add the given value for the given field. 首先传递值 get_value() 通过给予 processors 和 kwargs ,然后通过 field input processor 其结果附加到为该字段收集的数据中。如果字段已包含收集的数据,则添加新数据。
给定的 field_name 可以是 None ,在这种情况下,可以添加多个字段的值。处理后的值应该是字段名映射到值的dict。
loader.add_value('name', u'Color TV')
loader.add_value('colours', [u'white', u'blue'])
loader.add_value('length', u'100')
loader.add_value('name', u'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {'name': u'foo', 'sex': u'male'})
replace_value(field_name, value, *processors, **kwargs) 将收集到的数值替换为新值.
get_xpath(xpath, *processors, **kwargs) is used to extract a list of unicode strings from the selector associated with this ItemLoader. Parameters: xpath (str) – the XPath to extract data from re (str or compiled regex) – a regular expression to use for extracting data from the selected XPath region
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
add_xpath(field_name, xpath, *processors, **kwargs) used to extract a list of unicode strings from the selector associated with this ItemLoader.
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
get_css(css, *processors, **kwargs)
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
add_css(field_name, css, *processors, **kwargs)
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')
replace_css(field_name, css, *processors, **kwargs)
load_item() Populate the item with the data collected so far, and return it.
nested_xpath(xpath) Create a nested loader with an xpath selector.
get_collected_values(field_name) Return the collected values for the given field.
get_output_value(field_name) Return the collected values parsed using the output processor, for the given field.
get_input_processor(field_name) Return the input processor for the given field.
get_output_processor(field_name) Return the output processor for the given field.
attributes: item:The Item object being parsed by this Item Loader.
context 当前活动的 Context 此项目加载器的。
default_item_class 项类(或工厂),用于在构造函数中未给定项时实例化项。
default_input_processor 用于那些未指定字段的默认输入处理器。
default_output_processor 用于未指定字段的默认输出处理器。
default_selector_class 用于构造 selector 其中 ItemLoader ,如果构造函数中只给出响应。如果在构造函数中给定了选择器,则忽略此属性。此属性有时在子类中被重写。
selector 这个 Selector 从中提取数据的对象。它要么是构造函数中给定的选择器,要么是使用 default_selector_class . 此属性是只读的。
When parsing related values from a subsection of a document, it can be useful to create nested loaders. example:
<footer>
<a class="social" href="https://facebook.com/whatever">Like Us</a>
<a class="social" href="https://twitter.com/whatever">Follow Us</a>
<a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>
no nested loaders
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath('social', '//footer/a[@class = "social"]/@href')
loader.add_xpath('email', '//footer/a[@class = "email"]/@href')
loader.load_item()
nested loaders
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
remove those dashes by reusing and extending the default Product Item Loader (ProductLoader):
from scrapy.loader.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
class scrapy.loader.processors.Identity 最简单的处理器,它什么都不做。它返回原始值不变。它不接收任何构造函数参数,也不接受加载程序上下文。
>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']
class scrapy.loader.processors.TakeFirst 从接收的值返回第一个非空/非空值,因此它通常用作单值字段的输出处理器。它不接收任何构造函数参数,也不接受加载程序上下文。
>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'
class scrapy.loader.processors.Join(separator=u' ') 返回与构造函数中给定的分隔符联接的值,默认为 u' ' . 它不接受加载器上下文。 使用默认分隔符时,此处理器相当于以下函数: u' '.join
>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
'one<br>two<br>three'
class scrapy.loader.processors.Compose(*functions, **default_loader_context) 一种由给定函数组成的处理器。这意味着该处理器的每个输入值都传递给第一个函数,该函数的结果传递给第二个函数,依此类推,直到最后一个函数返回该处理器的输出值为止。 By default, stop process on None value. This behaviour can be changed by passing keyword argument stop_on_none=False.
>>> from scrapy.loader.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['hello', 'world'])
'HELLO'
class scrapy.loader.processors.MapCompose(*functions, **default_loader_context) 此处理器的输入值为 迭代的 第一个函数应用于每个元素。这些函数调用的结果(每个元素一个)被连接起来,以构造一个新的iterable,然后使用它来应用第二个函数,依此类推,直到最后一个函数应用到迄今为止收集的值列表的每个值为止。最后一个函数的输出值被连接在一起以产生这个处理器的输出。
compase收到的是一整个list,mapcompose是从list里面取出一个一个进行传入.
class scrapy.loader.processors.SelectJmes(json_path) 使用提供给构造函数的JSON路径查询值并返回输出。
>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("foo") #for direct use on lists and dictionaries
>>> proc({'foo': 'bar'})
'bar'
>>> proc({'foo': {'bar': 'baz'}})
{'bar': 'baz'}
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('foo')))
>>> proc_json_list('[{"foo":"bar"}, {"baz":"tar"}]')
['bar']
官方建议使用ipython替换python shell
pip install ipython
可以设置 SCRAPY_PYTHON_SHELL 环境变量 也可以在scrapy.cfg中进行修改:
[settings]
shell = bpython
scrapy shell <url>
local file
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html
# File URI
scrapy shell file:///absolute/path/to/file.html
shortcuts:
shelp() -打印有关可用对象和快捷方式列表的帮助
fetch(url[, redirect=True]) - 从给定的URL获取一个新的响应,并相应地更新所有相关的对象。您可以选择要求HTTP 3xx重定向后不传递重定向=False
fetch(request) -从给定的请求中获取新的响应,并相应地更新所有相关对象。
view(response) -在本地Web浏览器中打开给定的响应以进行检查。这将增加一个
Scrapy objects: crawler - the current Crawler object. spider - the Spider which is known to handle the URL, or a Spider object if there is no spider found for the current URL request - a Request object of the last fetched page. You can modify this request using replace() or fetch a new request (without leaving the shell) using the fetch shortcut. response - a Response object containing the last fetched page settings - the current Scrapy settings
After an item has been scraped by a spider, it is sent to the Item Pipeline. 就是进行拿到的数据进行处理
每个item pipeline组件都是一个python类,必须实现以下方法: This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components. Parameters: item (Item object or a dict) – the item scraped spider (Spider object) – the spider which scraped the item 还可以实现以下方法: open_spider(self, spider): spider打开时调用此方法
close_spider(self, spider): 当spider关闭时调用此方法。
from_crawler(cls,crawler): 如果存在,则调用此ClassMethod从 Crawler . 它必须返回管道的新实例。爬虫对象提供对所有零碎核心组件(如设置和信号)的访问;它是管道访问它们并将其功能连接到scrapy的一种方式。
Price validation and dropping items with no prices:
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):
if item.get('price'):
if item.get('price_excludes_vat'):
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
将项目写入JSON文件:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
将项目写入MongoDB:
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item
重复筛选器 查找重复项并删除已处理的项的筛选器。假设我们的项目有一个唯一的ID,但是我们的spider返回具有相同ID的多个项目:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
若要激活项管道组件,必须将其类添加到 ITEM_PIPELINES 设置,如以下示例中所示: ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, } 在此设置中分配给类的整数值决定了它们的运行顺序:项从低值类传递到高值类。习惯上把这些数字定义在0-1000范围内。
你好,请问一下。 我在编辑好字典内容抓取指定页面信息保存json格式文件后,发现生成的json文件为空。反复删除试了几次,并且也用了您的代码,另外也换了.jl, .csv 还是这样。请问您有遇到过这种情况吗? 谢谢
你好,请问一下。 我在编辑好字典内容抓取指定页面信息保存json格式文件后,发现生成的json文件为空。反复删除试了几次,并且也用了您的代码,另外也换了.jl, .csv 还是这样。请问您有遇到过这种情况吗? 谢谢
你好,我不知道你说的是哪个,能把代码截图或直接代码块发到评论里么,我看一下,我刚试了一下
这个代码写错了,没有::text,删掉就好了
the scrapy understand
Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。
创建项目
cmd运行
scrapy startproject tutorial
,新建一个项目 创建一个tutorial目录: tutorial/ scrapy.cfg 部署配置文件 tutorial/ 项目python模块,你从这里导入代码 init.py items.py 项目内容定义文件 middlewares.py 项目中间件文件 pipelines.py 项目管道文件 settings.py 项目设置文件 spidders/ 爬虫的目录init.py
第一只爬虫
将其保存在名为的 quotes_spider.py文件中,tutorial/spiders 项目中的目录
name:标识蜘蛛,一个项目中必须是唯一的。 start_requests():必须返回一个ITable of erquests(一个请求列表或者编写一个生成器函数),蜘蛛将从中开始爬行 parse():处理每个请求下载的响应,解析响应
运行
scrapy crawl quotes
quotes是quotes_spider.py中name属性启动请求方法的快捷方式
这里跟上面的代码少了start_requests方法,把urls提取了出来,这里是因为parse方法是scrapy的默认解析方法
提取数据
学习如何使用scrappy提取数据的最佳方法是使用 Scrapy shell windows:
scrapy shell "http://quotes.toscrape.com/page/1/"
然后使用css语句进行数据提取: 获得title标签中text的内容
response.css('title::text').getall()
获取title标签内容,包含titleresponse.css('title').getall()
getall()是返回可能有多个结果,只想得到第一个使用getresponse.css('title::text').get()
上个方法的替代方法response.css('title::text')[0].get()
这里使用get()方法,因为避免indexerror返回一个none
除了使用getall(),get(),还可以使用正则re()
Extracting data in our spider
use yield python keyword and css selector for a dict that contains text,author,tags from pagination
storing the scraped data
scrapy crawl quotes -o quotes.json
that will generate an 'quotes.json' file containing all scraped items,serialized in JSON. If you run this command twice without removing the file before the second time, you’ll end up with a broken JSON file.the other format,JSON Lines:
scrapy crawl quotes -o quotes.jl
it's stream-like,you can easily append new record to it. It dosen't have the same problem of JSON when you run twice. as each record is a separate line,you can process big files without having to fit everything in memory.
Extracting data from all pages
we can use css selector to get next page,then use 'scrapy.Request' method and the next page to keep the crawling going through all the pages. it creates a port of loop,following all the links to the next page until it doesn't find one - handy for crawling blogs,forums and other sites with pagination.
a shortcut for creating Requests
you can use 'response.follow' replace 'scrapy.request'
'response.follow' supports relative URLs derectly,it means no need to call urljoin.
you can also pass a selector to 'response.follow' instead of a string:
it makes me to use list of urls.
now,for a,you can make it more shorter. 'response.follow' uses their href attribute automatically.
you can pass a selector to 'response.follow',not selectors.