ramsayleung / jd_spider

Two dumb distributed crawlers
https://ramsayleung.github.io/zh/post/2017/jd_spider/
727 stars 208 forks source link

erro #11

Closed dengerwa closed 6 years ago

dengerwa commented 6 years ago

E:\python\python.exe E:/pythowork/jd_spider-master/jd_spider-master/jd/jd/spiders/jd.py Traceback (most recent call last): File "E:/pythowork/jd_spider-master/jd_spider-master/jd/jd/spiders/jd.py", line 10, in from jd.items import ParameterItem File "E:\pythowork\jd_spider-master\jd_spider-master\jd\jd\spiders\jd.py", line 10, in from jd.items import ParameterItem 这个是啥东东

ramsayleung commented 6 years ago

请把你的问题描述清楚,比如你的具体操作,如何启动爬虫的,启动的是什么爬虫。是否安装了必要的mongodb 和redis 等等, 不然我是没办法猜到你遇到什么东东的

dengerwa commented 6 years ago

安装好了pip install -r requirements.txt 这个里面的插件 只有一个没有安装graphite, mongodb安装在另外一个Ubuntu服务器上面 。然后运行jd.py
E:\pywork\venv1\Scripts\python.exe E:/pywork/jd_spider-master/jd/jd/spiders/jd.py

do w2usz fkar 1mydqps0

dengerwa commented 6 years ago

ITEM_PIPELINES = { 'jd.pipelines.MongoDBPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 300 } MONGODB_SERVER = "192.168.1.168" MONGODB_PORT = 27017 MONGODB_DB = "jindong"

setting.py 里面我已经修改了服务器地址和端口了 然后手动创建了一个jindong 数据

ramsayleung commented 6 years ago

你给出来的堆栈信息应该不是完整的堆栈信息,可否给出完整的堆栈呢? 关于你的问题,我猜测可能是两个原因,你没有安装 graphite,现在jd_spider 这只爬虫的启动依赖graphite 所以你没有装 graphite, 会导致启动不了爬虫。解决方法是你自己把graphite 的依赖去掉,graphite 的作用只是监控,去掉并不会影响爬虫功能,或者可以等我放假有时间把依赖去掉。 另外一个原因可能是你的mongodb 连接有问题.

dengerwa commented 6 years ago

好的 我再试试看

dengerwa commented 6 years ago

需要手动创建列表吗? 感觉数据库里面没有创建 相关关键字

ramsayleung commented 6 years ago

我使用数据库是mongodb, 使用操作mongodb 的python library 是pymongo, mongodb 在插入数据库的时候,如果数据库不存在,会自动创建数据库.

MongoDB creates databases and collections automatically for you if they don't exist already.

如果你对pymongo 或者mongodb 有疑问的话,可以查看 pymongo文档mongodb文档

何况在你的截图里,jingdong这个数据库已经自动创建了

dengerwa commented 6 years ago

那个数据库是我手动创建的 谢谢 我去看看pymongo文档

dengerwa commented 6 years ago

/usr/bin/python3.5 /home/dengbo/pywork/jd_spider-master/jd/jd/spiders/jd.py Traceback (most recent call last): File "/home/dengbo/pywork/jd_spider-master/jd/jd/spiders/jd.py", line 10, in from jd.items import ParameterItem File "/home/dengbo/pywork/jd_spider-master/jd/jd/spiders/jd.py", line 10, in from jd.items import ParameterItem ImportError: No module named 'jd.items'; 'jd' is not a package

还是搞不定 mongo 已经安装了 但是连不上 如果您有空的话 能帮我看看吗?

ramsayleung commented 6 years ago

这个报错和mongo 没关系,你能不能把你运行爬虫的路径截图出来.我觉得你是运行的路径出错了.你试试在 /home/dengbo/pywork/jd_spider-master/jd 执行scrapy crawl jindong.

dengerwa commented 6 years ago

2018-04-06 17-57-31 您看看是不是这样子运行的

dengerwa commented 6 years ago

root@dengbo-ThinkPad:/home/dengbo/pywork/jd_spider-master/jd# scrapy crawl jindong 2018-04-06 18:01:28 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: jd) 2018-04-06 18:01:28 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'SPIDER_MODULES': ['jd.spiders'], 'BOT_NAME': 'jd', 'NEWSPIDER_MODULE': 'jd.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'STATS_CLASS': 'jd.statscol.graphite.RedisGraphiteStatsCollector', 'CONCURRENT_REQUESTS': 32} 2018-04-06 18:01:28 [py.warnings] WARNING: /home/dengbo/pywork/jd_spider-master/jd/jd/statscol/graphite.py:7: ScrapyDeprecationWarning: Module scrapy.log has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more. from scrapy import log

2018-04-06 18:01:28 [py.warnings] WARNING: /home/dengbo/pywork/jd_spider-master/jd/jd/statscol/graphite.py:8: ScrapyDeprecationWarning: Module scrapy.statscol is deprecated, use scrapy.statscollectors instead from scrapy.statscol import StatsCollector

2018-04-06 18:01:28 [jindong] WARNING: could not connect to graphite Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 439, in connect sock = self._connect() File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 494, in _connect raise err File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 482, in _connect sock.connect(socket_address) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 572, in execute_command connection.send_command(args) File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 563, in send_command self.send_packed_command(self.pack_command(args)) File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 538, in send_packed_command self.connect() File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 442, in connect raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 439, in connect sock = self._connect() File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 494, in _connect raise err File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 482, in _connect sock.connect(socket_address) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/scrapy", line 11, in sys.exit(execute()) File "/usr/local/lib/python3.5/dist-packages/scrapy/cmdline.py", line 149, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/usr/local/lib/python3.5/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help func(*a, kw) File "/usr/local/lib/python3.5/dist-packages/scrapy/cmdline.py", line 156, in _run_command cmd.run(args, opts) File "/usr/local/lib/python3.5/dist-packages/scrapy/commands/crawl.py", line 57, in run self.crawler_process.crawl(spname, opts.spargs) File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 167, in crawl crawler = self.create_crawler(crawler_or_spidercls) File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 195, in create_crawler return self._create_crawler(crawler_or_spidercls) File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 200, in _create_crawler return Crawler(spidercls, self.settings) File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 52, in init self.extensions = ExtensionManager.from_crawler(self) File "/usr/local/lib/python3.5/dist-packages/scrapy/middleware.py", line 58, in from_crawler return cls.from_settings(crawler.settings, crawler) File "/usr/local/lib/python3.5/dist-packages/scrapy/middleware.py", line 53, in from_settings extra={'crawler': crawler}) File "/usr/lib/python3.5/logging/init.py", line 1279, in info self._log(INFO, msg, args, *kwargs) File "/usr/lib/python3.5/logging/init.py", line 1415, in _log self.handle(record) File "/usr/lib/python3.5/logging/init.py", line 1425, in handle self.callHandlers(record) File "/usr/lib/python3.5/logging/init.py", line 1487, in callHandlers hdlr.handle(record) File "/usr/lib/python3.5/logging/init.py", line 855, in handle self.emit(record) File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/log.py", line 179, in emit self.crawler.stats.inc_value(sname) File "/home/dengbo/pywork/jd_spider-master/jd/jd/statscol/graphite.py", line 242, in inc_value key, count, start, spider) File "/home/dengbo/pywork/jd_spider-master/jd/jd/statscol/graphite.py", line 153, in inc_value if not self.server.hexists(self.stats_key, key): File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 1853, in hexists return self.execute_command('HEXISTS', name, key) File "/usr/local/lib/python3.5/dist-packages/redis/client.py", line 578, in execute_command connection.send_command(args) File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 563, in send_command self.send_packed_command(self.pack_command(*args)) File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 538, in send_packed_command self.connect() File "/usr/local/lib/python3.5/dist-packages/redis/connection.py", line 442, in connect raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused. root@dengbo-ThinkPad:/home/dengbo/pywork/jd_spider-master/jd#

ramsayleung commented 6 years ago

不,用命令行,而不是pycharm.

cd /home/dengbo/pywork/jd_spider-master/jd
scrapy crawl jindong

还有一点,拉取最新的代码.原来的代码没有装graphite是会报错的,你也没有安装redis.

dengerwa commented 6 years ago

raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused. 2018-04-06 22-37-52 我刚才下载了最新版的那个 redis 安装了哒 。

dengerwa commented 6 years ago

这个问题是因为没安装Redis的服务器造成的。 我才郁闷 哈哈哈 解决办法:

sudo apt-get install redis-server

dengerwa commented 6 years ago

请问一下 PROXY_LIST = 'path/to/proxy_ip.txt' 这个路径填写 绝对路径 还是按照您教程里面的填写

ramsayleung commented 6 years ago

我文档里面写的就是 PROXY_LIST = 'path/to/proxy_ip.txt', 如果你有代理IP, 相对路径和绝对路径都是可以的,但是相对路径依靠你运行的路径,所以还是绝对路径对你来说更合适

dengerwa commented 6 years ago

我想问一下 代理IP 可以不加http://吗?