Open gnosis23 opened 5 years ago
scrapy 拓展性非常好,设置里可以按需添加各种功能。比如我要添加个 pipeline,可以按如下写法操作:
ITEM_PIPELINES = {
'spider1.pipelines.PriceConverterPipeline': 300,
'spider1.pipelines.MongoDBPipeline': 400,
}
那两个类都是自定义的。 那么 python 是怎么把字符串转化成具体的类呢 ?
首先看 load_object
这个方法,重要的是 import_module 方法可以在运行时加载类。
这好像时 Python 2.7 新加的。
from importlib import import_module
def load_object(path):
"""Load an object given its absolute object path, and return it.
object can be a class, function, variable or an instance.
path ie: 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'
"""
try:
dot = path.rindex('.')
except ValueError:
raise ValueError("Error loading object '%s': not a full path" % path)
module, name = path[:dot], path[dot+1:]
mod = import_module(module)
try:
obj = getattr(mod, name)
except AttributeError:
raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
return obj
然后看看怎么用 load_object
, 注意 @classmethod
修饰的方法可以使用cls实例化类。
有种依赖注入的感觉...
class Scheduler(object):
def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
logunser=False, stats=None, pqclass=None, crawler=None):
self.df = dupefilter
self.dqdir = self._dqdir(jobdir)
self.pqclass = pqclass
self.dqclass = dqclass
self.mqclass = mqclass
self.logunser = logunser
self.stats = stats
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
dupefilter = create_instance(dupefilter_cls, settings, crawler)
pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser,
stats=crawler.stats, pqclass=pqclass, dqclass=dqclass,
mqclass=mqclass, crawler=crawler)
用这个包搭个入门基本的爬虫简直so easy。
前期学一下基本概念,还有 Python 语法...
环境搭建
运行脚本
参考资料