scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.98k stars 10.55k forks source link

Customized dupefilter slows down item pipeline's speed #4689

Closed Rossil2012 closed 4 years ago

Rossil2012 commented 4 years ago

Description

When using scrapy-redis or self-defined dupefilters, the speed of item pipeline will be extremely slow. It's not Redis's problem because when I move the deduplication to DownloadMiddleware (check url finger in process_request, add url finger in process_response), the speed of item pipeline returns to normal. Here are two links recording detailed information. https://github.com/rmax/scrapy-redis/issues/174 https://stackoverflow.com/questions/63026873/scarpy-redis-slows-down-item-pipelines

Steps to Reproduce

  1. set DUPEFILTER_CLASS to "project.filename.customize-dupefilter-class"

Expected behavior: Item pipeline's speed should not varies dramartically

Actual behavior: Item pipeline's speed is extremely slow

Reproduces how often: Always

Versions

Scrapy : 2.2.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.7.7 (default, Mar 10 2020, 15:43:33) - [Clang 11.0.0 (clang-1100.0.33.17)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 2.9.2 Platform : Darwin-19.5.0-x86_64-i386-64bit

elacuesta commented 4 years ago

Please share a Minimal, Reproducible Example, it's hard to debug an issue without any code.

The following is a simple snippet that could be used as a template. It scrapes 111 items, filtering out 1957 duplicated pages. In my tests, the elapsed time goes from ~5s to ~30s if the time.sleep line is uncommented.

import time
from scrapy import Spider
from scrapy.dupefilters import RFPDupeFilter

class Dupefilter(RFPDupeFilter):
    def request_seen(self, request):
        # time.sleep(0.01)  # blocking, uncomment to decrease performance
        return super().request_seen(request)

class Pipeline:
    def process_item(self, item, spider):
        print(item)  # just to see the items are being processed
        return item

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/page/1/"]
    custom_settings = {
        "DUPEFILTER_CLASS": __name__ + ".Dupefilter",
        "ITEM_PIPELINES": {
             __name__ + ".Pipeline": 100,
        },
    }

    def parse(self, response):
        yield dict(url=response.url)
        yield from response.follow_all(css="a.tag")
Rossil2012 commented 4 years ago

Here is the code.

import hashlib
from redis import StrictRedis
from scrapy.dupefilters import RFPDupeFilter
import os
import redis
from w3lib.url import canonicalize_url
from itemadapter import ItemAdapter
import pymongo

class URLRedisFilter(RFPDupeFilter):
    def __init__(self, path=None, debug=False):
        RFPDupeFilter.__init__(self, path)
        self.dupefilter = UrlFilter()

    def request_seen(self, request):
        if self.dupefilter.check_url(request.url):
            return True

        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

class UrlFilter(object):
    def __init__(self):
        redis_config = {
            "host": "xx.xx.xx.xx",  # redis ip
            "port": 6379,
            "password": "xxxx",
            "db": 1,
        }
        pool = redis.ConnectionPool(**redis_config)
        self.pool = pool
        self.redis = StrictRedis(connection_pool=pool)
        self.key = "xxxx"

    def url_sha1(self, url):
        fp = hashlib.sha1()
        fp.update(canonicalize_url(url).encode("utf-8"))
        url_sha1 = fp.hexdigest()
        return url_sha1

    def check_url(self, url):
        sha1 = self.url_sha1(url)
        isExist = self.redis.sismember(self.key, sha1)
        return isExist

    def add_url(self, url):
        sha1 = self.url_sha1(url)
        added = self.redis.sadd(self.key, sha1)
        return added`

class Pipeline:
    def open_spider(self, spider):
        self.mongo_client = pymongo.MongoClient('mongodb://usr:pwd@xx.xx.xx.xx:27017/xxx')

    def close_spider(self, spider):
        self.mongo_client.close()

    def process_item(self, item, spider):
        item_dict = ItemAdapter(item).asdict()
        mongo = self.mongo_client['xxx']['xxx']
        mongo.insert_one(item_dict)
        return item

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com/page/1/"]
    custom_settings = {
        "DUPEFILTER_CLASS": __name__ + ".Dupefilter",
        "ITEM_PIPELINES": {
             __name__ + ".Pipeline": 100,
        },
    }

    def parse(self, response):
        yield dict(url=response.url)
        yield from response.follow_all(css="a.tag")
wRAR commented 4 years ago

Slowing is expected when you use long blocking operations.

elacuesta commented 4 years ago

As @wRAR says, blocking operations such as mongo's insert_one are expected to slow down the process. You should consider switching to a library that supports coroutine syntax (pymongo does not, AFAIK) or returning a Deferred from the pipeline's process_item method.

Rossil2012 commented 4 years ago

Thanks for your help. But I'm still confused because when I place the check_url and add_url in the DownloadMiddleware but not the dupefilter, the operation with MongoDB is much faster.

wRAR commented 4 years ago

How did you measure that?

Rossil2012 commented 4 years ago

I count down 10 minutes for each methods, when using dupefilter only 50 items are scraped and stored in MongoDB, while using the other method there are more than 1000 items, which is almost the same as the speed when I didn't write any dupefilter or middleware. Anyway, when I press CTRL-C and see from the logger info that all the requests are finished and there are only item pipeline are excuting, the speed is 0-4 items/min, which is much slower than manually inserting into MongoDB.

Rossil2012 commented 4 years ago

Additionally, I have tried deferToThread for process_item method in Pipeline, but it doesn't work.

wRAR commented 4 years ago

It's unfortunately too complicated to replicate your setup and you didn't provide a minimal reproducible example so we will need to guess what happens.

I count down 10 minutes for each methods, when using dupefilter only 50 items are scraped and stored in MongoDB, while using the other method there are more than 1000 items

Have you checked that they are stored or just that Scrapy said they were scraped?

OTOH synchronous operations don't necessarily lead to significant delays so it's plausible that the Mongo pipeline doesn't cause problems but the Redis dupefilter does.

Rossil2012 commented 4 years ago

Have you checked that they are stored or just that Scrapy said they were scraped?

Yes, I checked MongoDB directly.

It's unfortunately too complicated to replicate your setup and you didn't provide a minimal reproducible example so we will need to guess what happens.

I'm sorry that I can't provide you with my full codes now because it's part of my class project and my teacher asks us to keep it private until the class is over and I use a hacked version of a parsing package, it's way too much to be placed here. But the only differences between the real codes and the above one is that I'm scraping Wikipedia with its API, but not "http://quotes.toscrape.com/page/1/". You may wonder why I use Scrapy to scrape a website which has already provide APIs. It is because I don't want to write a concurrent spider from scratch and Scrapy can be easily extended into a distributed spider. Moreover, in the parse method of the spider, I'm using requests library to make some requests that I need immediately, and only yield the requests that need to be scheduled to be in Scrapy's control.

Gallaecio commented 4 years ago

I'm sorry that I can't provide you with my full codes now

Minimal reproducible example means the opposite. You provided too much code. We are asking you for a stripped down version of your code, removing anything that is not necessary to reproduce the issue.

wRAR commented 4 years ago

Moreover, in the parse method of the spider, I'm using requests library to make some requests that I need immediately

This is synchronous too.