gnosis23 / hello-world-blog

还是 issues 里面写文章方便
https://bohao.work
0 stars 0 forks source link

Scrapy 爬虫 #49

Open gnosis23 opened 5 years ago

gnosis23 commented 5 years ago

用这个包搭个入门基本的爬虫简直so easy。

前期学一下基本概念,还有 Python 语法...

环境搭建

运行脚本

# installation
pip install scrapy

scrapy startproject spider1
scrapy genspider [option] name url
scrapy crawl spider_name

参考资料

gnosis23 commented 5 years ago

源码里的技巧

运行时加载类

scrapy 拓展性非常好,设置里可以按需添加各种功能。比如我要添加个 pipeline,可以按如下写法操作:

ITEM_PIPELINES = {
  'spider1.pipelines.PriceConverterPipeline': 300,
  'spider1.pipelines.MongoDBPipeline': 400,
}

那两个类都是自定义的。 那么 python 是怎么把字符串转化成具体的类呢 ?

首先看 load_object 这个方法,重要的是 import_module 方法可以在运行时加载类。 这好像时 Python 2.7 新加的。

from importlib import import_module
def load_object(path):
    """Load an object given its absolute object path, and return it.

    object can be a class, function, variable or an instance.
    path ie: 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'
    """

    try:
        dot = path.rindex('.')
    except ValueError:
        raise ValueError("Error loading object '%s': not a full path" % path)

    module, name = path[:dot], path[dot+1:]
    mod = import_module(module)

    try:
        obj = getattr(mod, name)
    except AttributeError:
        raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))

    return obj

然后看看怎么用 load_object, 注意 @classmethod 修饰的方法可以使用cls实例化类。 有种依赖注入的感觉...

class Scheduler(object):
    def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
                 logunser=False, stats=None, pqclass=None, crawler=None):
        self.df = dupefilter
        self.dqdir = self._dqdir(jobdir)
        self.pqclass = pqclass
        self.dqclass = dqclass
        self.mqclass = mqclass
        self.logunser = logunser
        self.stats = stats
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
        dupefilter = create_instance(dupefilter_cls, settings, crawler)
        pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
        return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser,
                   stats=crawler.stats, pqclass=pqclass, dqclass=dqclass,
                   mqclass=mqclass, crawler=crawler)
gnosis23 commented 5 years ago

资源

测试用网址

免费代理

都是垃圾

付费

爬虫代理哪家强?十大付费代理详细对比评测出炉!