Closed Rossil2012 closed 4 years ago
Please share a Minimal, Reproducible Example, it's hard to debug an issue without any code.
The following is a simple snippet that could be used as a template. It scrapes 111 items, filtering out 1957 duplicated pages. In my tests, the elapsed time goes from ~5s to ~30s if the time.sleep
line is uncommented.
import time
from scrapy import Spider
from scrapy.dupefilters import RFPDupeFilter
class Dupefilter(RFPDupeFilter):
def request_seen(self, request):
# time.sleep(0.01) # blocking, uncomment to decrease performance
return super().request_seen(request)
class Pipeline:
def process_item(self, item, spider):
print(item) # just to see the items are being processed
return item
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["http://quotes.toscrape.com/page/1/"]
custom_settings = {
"DUPEFILTER_CLASS": __name__ + ".Dupefilter",
"ITEM_PIPELINES": {
__name__ + ".Pipeline": 100,
},
}
def parse(self, response):
yield dict(url=response.url)
yield from response.follow_all(css="a.tag")
Here is the code.
import hashlib
from redis import StrictRedis
from scrapy.dupefilters import RFPDupeFilter
import os
import redis
from w3lib.url import canonicalize_url
from itemadapter import ItemAdapter
import pymongo
class URLRedisFilter(RFPDupeFilter):
def __init__(self, path=None, debug=False):
RFPDupeFilter.__init__(self, path)
self.dupefilter = UrlFilter()
def request_seen(self, request):
if self.dupefilter.check_url(request.url):
return True
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
class UrlFilter(object):
def __init__(self):
redis_config = {
"host": "xx.xx.xx.xx", # redis ip
"port": 6379,
"password": "xxxx",
"db": 1,
}
pool = redis.ConnectionPool(**redis_config)
self.pool = pool
self.redis = StrictRedis(connection_pool=pool)
self.key = "xxxx"
def url_sha1(self, url):
fp = hashlib.sha1()
fp.update(canonicalize_url(url).encode("utf-8"))
url_sha1 = fp.hexdigest()
return url_sha1
def check_url(self, url):
sha1 = self.url_sha1(url)
isExist = self.redis.sismember(self.key, sha1)
return isExist
def add_url(self, url):
sha1 = self.url_sha1(url)
added = self.redis.sadd(self.key, sha1)
return added`
class Pipeline:
def open_spider(self, spider):
self.mongo_client = pymongo.MongoClient('mongodb://usr:pwd@xx.xx.xx.xx:27017/xxx')
def close_spider(self, spider):
self.mongo_client.close()
def process_item(self, item, spider):
item_dict = ItemAdapter(item).asdict()
mongo = self.mongo_client['xxx']['xxx']
mongo.insert_one(item_dict)
return item
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["http://quotes.toscrape.com/page/1/"]
custom_settings = {
"DUPEFILTER_CLASS": __name__ + ".Dupefilter",
"ITEM_PIPELINES": {
__name__ + ".Pipeline": 100,
},
}
def parse(self, response):
yield dict(url=response.url)
yield from response.follow_all(css="a.tag")
Slowing is expected when you use long blocking operations.
As @wRAR says, blocking operations such as mongo's insert_one
are expected to slow down the process. You should consider switching to a library that supports coroutine syntax (pymongo
does not, AFAIK) or returning a Deferred
from the pipeline's process_item
method.
Thanks for your help. But I'm still confused because when I place the check_url and add_url in the DownloadMiddleware but not the dupefilter, the operation with MongoDB is much faster.
How did you measure that?
I count down 10 minutes for each methods, when using dupefilter only 50 items are scraped and stored in MongoDB, while using the other method there are more than 1000 items, which is almost the same as the speed when I didn't write any dupefilter or middleware. Anyway, when I press CTRL-C and see from the logger info that all the requests are finished and there are only item pipeline are excuting, the speed is 0-4 items/min, which is much slower than manually inserting into MongoDB.
Additionally, I have tried deferToThread for process_item method in Pipeline, but it doesn't work.
It's unfortunately too complicated to replicate your setup and you didn't provide a minimal reproducible example so we will need to guess what happens.
I count down 10 minutes for each methods, when using dupefilter only 50 items are scraped and stored in MongoDB, while using the other method there are more than 1000 items
Have you checked that they are stored or just that Scrapy said they were scraped?
OTOH synchronous operations don't necessarily lead to significant delays so it's plausible that the Mongo pipeline doesn't cause problems but the Redis dupefilter does.
Have you checked that they are stored or just that Scrapy said they were scraped?
Yes, I checked MongoDB directly.
It's unfortunately too complicated to replicate your setup and you didn't provide a minimal reproducible example so we will need to guess what happens.
I'm sorry that I can't provide you with my full codes now because it's part of my class project and my teacher asks us to keep it private until the class is over and I use a hacked version of a parsing package, it's way too much to be placed here. But the only differences between the real codes and the above one is that I'm scraping Wikipedia with its API, but not "http://quotes.toscrape.com/page/1/". You may wonder why I use Scrapy to scrape a website which has already provide APIs. It is because I don't want to write a concurrent spider from scratch and Scrapy can be easily extended into a distributed spider. Moreover, in the parse method of the spider, I'm using requests library to make some requests that I need immediately, and only yield the requests that need to be scheduled to be in Scrapy's control.
I'm sorry that I can't provide you with my full codes now
Minimal reproducible example means the opposite. You provided too much code. We are asking you for a stripped down version of your code, removing anything that is not necessary to reproduce the issue.
Moreover, in the parse method of the spider, I'm using requests library to make some requests that I need immediately
This is synchronous too.
Description
When using scrapy-redis or self-defined dupefilters, the speed of item pipeline will be extremely slow. It's not Redis's problem because when I move the deduplication to DownloadMiddleware (check url finger in process_request, add url finger in process_response), the speed of item pipeline returns to normal. Here are two links recording detailed information. https://github.com/rmax/scrapy-redis/issues/174 https://stackoverflow.com/questions/63026873/scarpy-redis-slows-down-item-pipelines
Steps to Reproduce
Expected behavior: Item pipeline's speed should not varies dramartically
Actual behavior: Item pipeline's speed is extremely slow
Reproduces how often: Always
Versions
Scrapy : 2.2.0 lxml : 4.5.2.0 libxml2 : 2.9.10 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.7.7 (default, Mar 10 2020, 15:43:33) - [Clang 11.0.0 (clang-1100.0.33.17)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020) cryptography : 2.9.2 Platform : Darwin-19.5.0-x86_64-i386-64bit