istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Website did't response for a long time,how to solve the problem #214

Closed Johnson0016 closed 5 years ago

Johnson0016 commented 5 years ago

Hi,@madisonb I used scrapy-cluster recently,It's really useful for huge amount data,but at the same time,I got some problem,that's the issue: It's caused in a situation that target website did't response for a long time,I know that if status_code of response is 404 or 5xx for several times,scrapy-cluster will reput the url in end of redis queue. Well,it seems doesn't work when the problem(did't get response for long time ) came out. I did set DOWNLOAD_TIMEOUT to 30 seconds,it doesn't work sometimes. You got some good solution to solve the issue? should I use errback function or ...... Thanks for help!

`Traceback (most recent call last):

File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/python/failure.py", line 422, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 351, in _cb_timeout raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout)) twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://xxxxxxxxxxxxxxxxxxxxxxxxx.com/ took longer than 30.0 seconds..

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 66, in process_exception spider=spider) File "/home/kevin/project/scrapy-cluster/crawler/crawling/log_retry_middleware.py", line 93, in process_exception self._log_retry(request, exception, spider) File "/home/kevin/project/scrapy-cluster/crawler/crawling/log_retry_middleware.py", line 107, in _log_retry self.logger.error('Scraper Retry', extra=extras) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scutils/log_factory.py", line 254, in error extras = self.add_extras(extra, "ERROR") File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scutils/log_factory.py", line 329, in add_extras my_copy = copy.deepcopy(dict) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 247, in _deepcopy_method return type(x)(x.func, deepcopy(x.self, memo)) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 220, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 220, in y = [deepcopy(a, memo) for a in x] File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, *rv) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct state = deepcopy(state, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy y = copier(x, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 169, in deepcopy rv = reductor(4) TypeError: can't pickle select.epoll objects`

madisonb commented 5 years ago

We are at the mercy of Scrapy's internal process to ensure that the website returns in a reasonable time frame. I agree that the DOWNLOAD_TIMEOUT doesn't always work correctly, but to help solve this initially my suggestion is to use a lot of spiders in your cluster.

For example, if you have 10 spiders, and only 1 out of every 10 times your get a really long download timeout, at least your other 9 spiders would be working normally.

This project mainly focuses on a distributed scheduler mechanism to enable the spiders to get their tasking from the redis server. It does not do much to control how the spider itself downloads the html from the website.

I am going to close this issue since this seems to be more of a custom use case than a bug in the project, feel free to hop over to Gitter if you would like to chat more.