microacup / microacup.github.com

长者的经验
0 stars 1 forks source link

Scrapy 搞起篇 #31

Open microacup opened 10 years ago

microacup commented 10 years ago

Scrapy 搞起篇(2014年8月30日-8月31日)

开始于:Beijing,Raining,2014-8-30 22:45:00
结束于:2014-8-31 01:46:00

0. 环境

python 2.7.8

windows 8.1 x64

如果接下来用pip安装或者直接安装出现:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: 
ordinal not in range(128)

需要在Python27\Lib\site-packages 建一个文件sitecustomize.py。原因是pip安装python包会加载用户目录,用户目录恰好是中文的,ascii不能编码。python会自动运行这个文件。

1. Install pip

a. setuptools:

You must start the Powershell with Administrative privileges.
Using Windows 8 or later, it's possible to install with one simple Powershell command. Start up Powershell and paste this command:

(Invoke-WebRequest https://bootstrap.pypa.io/ez_setup.py).Content | python -
https://pip.pypa.io/en/latest/installing.html#install-pip

b. python setup.py install :

下载 pip源码,解压,安装。

E:\Python27\Scripts加入环境变量

2.安装Twisted(必须)

官网,傻瓜安装。

3. 安装w3lib

官方主页:http://pypi.python.org/pypi/w3lib

Github:https://github.com/scrapy/w3lib

4. 安装Zope.Interface (必须)

官网:https://pypi.python.org/pypi/zope.interface/4.1.1#downloads

5. 安装OpenSSL(未成功,但仍然安装了Scrapy)

官网:https://pypi.python.org/pypi/pyOpenSSL

可能会出现error: Setup script exited with error: Unable to find vcvarsall.bat, 需要安装Visual Studio。

如果安装了VS2010或是VS2012可以使用下面的方法解决: If you have Visual Studio 2010 installed, execute > SET VS90COMNTOOLS=%VS100COMNTOOLS% or with Visual Studio 2012 installed > SET VS90COMNTOOLS=%VS110COMNTOOLS% python 2.7在setup的时候查找的是VS2008编译的。 For Windows installations: While running setup.py for for package installations Python 2.7 searches for an installed Visual Studio 2008. You can trick Python to use newer Visual Studio by setting correct path in VS90COMNTOOLS environment variable before calling setup.py.

有人说VS2010无效。

建议还是用vs vs2008 c++ ,如果使用的是64位,一定要装pro版vs,因为express版没有64位编译器 或者 :

以下亲测无效

首先安装MinGW(MinGW下载地址:http://sourceforge.net/projects/mingw/files/),在MinGW的安装目录下找到bin的文件夹,找到mingw32-make.exe,复制一份更名为make.exe

把MinGW的路径添加到环境变量path中,比如我把MinGW安装到D:\MinGW\中,就把D:\MinGW\bin添加到path中; 打开命令行窗口,在命令行窗口中进入到要安装代码的目录下;

输入如下命令 setup.py install build –compiler=mingw32 就可以安装了。

6. 安装lxml(不知道干嘛,也不知是否必须)

官网:https://pypi.python.org/simple/lxml/

下载:lxml-3.3.6.win-amd64-py2.7.exe

7. 安装service_identity

下载:https://pypi.python.org/pypi/service_identity#downloads

8. 安装 Scrapy

下载源码安装,或者:

pip install scrapy

9. 创建工程

参考官方文档:http://doc.scrapy.org/en/latest/intro/tutorial.html

scrapy startproject tutorial

结果如下:

F:\code\python\scrapy-tutorial\tutorial>scrapy crawl dmoz
:0: UserWarning: You do not have a working installation of the service_identi
module: 'No module named service_identity'.  Please install it from <https://
i.python.org/pypi/service_identity> and make sure all of its dependencies are
tisfied.  Without the service_identity module and a recent enough pyOpenSSL t
upport it, Twisted can perform only rudimentary TLS client hostname verificat
.  Many valid certificate/hostname mappings may be rejected.
2014-08-31 01:38:12+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: tutorial)
2014-08-31 01:38:12+0800 [scrapy] INFO: Optional features available: ssl, htt

2014-08-31 01:38:12+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODU
: 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tu
ial'}
2014-08-31 01:38:13+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetC
ole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-31 01:38:14+0800 [scrapy] INFO: Enabled downloader middlewares: HttpA
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redi
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-31 01:38:14+0800 [scrapy] INFO: Enabled spider middlewares: HttpError
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMidd
are
2014-08-31 01:38:14+0800 [scrapy] INFO: Enabled item pipelines:
2014-08-31 01:38:14+0800 [dmoz] INFO: Spider opened
2014-08-31 01:38:14+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scrap
0 items (at 0 items/min)
2014-08-31 01:38:14+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.
023
2014-08-31 01:38:14+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6

2014-08-31 01:38:15+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org
mputers/Programming/Languages/Python/Resources/> (referer: None)
2014-08-31 01:38:15+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org
mputers/Programming/Languages/Python/Books/> (referer: None)
2014-08-31 01:38:15+0800 [dmoz] INFO: Closing spider (finished)
2014-08-31 01:38:15+0800 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 516,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 16515,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 8, 30, 17, 38, 15, 497000),
         'log_count/DEBUG': 4,
         'log_count/INFO': 7,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 8, 30, 17, 38, 14, 102000)}
2014-08-31 01:38:15+0800 [dmoz] INFO: Spider closed (finished)

F:\code\python\scrapy-tutorial\tutorial>

并在tutorial目录生成文件:Books和Resources

11. 部署到Scrapyd

文档:http://scrapyd.readthedocs.org/en/latest/overview.html

githu下载安装Scrapyd(略)

注意:scrapyd最好在bash环境下安装和使用,否则在cmd下发布不了,出现以下错误:

'scrapyd' 不是内部或外部命令,也不是可运行的程序或批处理文件。

使用方法:

启动.

bash $下执行命令scrapyd

scrapyd

如果没有执行这一步,直接发布,会出现以下错误:

Packing version 1409469352
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Deploy failed: <urlopen error [Errno 10061] >

发布.

cmd定位到工程目录,执行以下命令启动目录下所有工程:

scrapy deploy

也可以加参数,控制启动工程,这些参数需要在scrapy.cfg文件实现定义,比如:

scrapy.cfg文件:

[deploy:scrapyd2]
url = http://scrapyd.mydomain.com/api/scrapyd/
username = john
password = secret

cmd命令:

scrapy deploy scrapyd2 -p project

启动成功后,出现:

F:\code\python\scrapy-tutorial\tutorial>scrapy deploy
Packing version 1409481980
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1409481980", "spiders": 1, "
node_name": "Power"}

在url中可以看到当前的监控:

http://localhost:6800/

启动爬虫.

curl http://localhost:6800/schedule.json -d project=default -d spider=somespider

例如:

F:\code\python\scrapy-tutorial\tutorial>curl http://localhost:6800/schedule.json
 -d project=default -d spider=dmoz
{"status": "ok", "jobid": "d0c2844030ff11e49f53206a8a4b80ec", "node_name": "Powe
r"}
F:\code\python\scrapy-tutorial\tutorial>

在http://localhost:6800/jobs能看到当前爬虫任务情况

For more information about the API, see the Scrapyd documentation。

11. 常见问题

问题:

Unknown command: crawl

Use "scrapy" to see available commands

解决方法:

需要进入项目目录才能执行该命令。cd  tutorial

问题:

执行scrapy crawl dmoz 后出现Handler': No module named win32api

Handler': DLL load failed: %1 不是有效的 Win32 应用程序

解决方法:

可能是版本问题,提示如下操作:

出现No module named win32api异常,到这里下载当前Python(2.7)对应版本的安装模块:

安装pywin32模块,可在http://sourceforge.net/projects/pywin32/下载:
pywin32-219.win-amd64-py2.7.exe(版本必须与当前python的相同)

问题:

error: Unable to find vcvarsall.bat

解决方法:

建议还是用vs vs2008 c++ ,如果使用的是64位,一定要装pro版vs,因为express版没有64位编译器.

问题:

怎么定时爬取

解决方法:

windows下用任务计划定时执行批处理文件,或者Linux系统定时任务(比如crond)定时执行抓取。

参考:

http://doc.scrapy.org/en/latest/intro/install.html#intro-install

http://www.cnblogs.com/txw1958/archive/2012/07/12/scrapy_installation_introduce.html

http://my.oschina.net/zhangdapeng89/blog/54407

http://www.crifan.com/while_install_scrapy_error_unable_to_find_vcvarsall_bat/

http://blog.csdn.net/changdejie/article/details/18407979

http://www.kankanews.com/ICkengine/archives/94817.shtml

http://blog.chinaunix.net/uid-24567872-id-3925118.html

http://blog.csdn.net/iefreer/article/details/20677943

http://www.oschina.net/translate/build-website-crawler-based-upon-scrapy

http://blog.jobbole.com/73115/

http://www.itdiffer.com/doc-view-727.html

microacup commented 10 years ago

@zenglzh 搞起问题可以提交到这里。