scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

scrapy.pqueues.ScrapyPriorityQueue #6266

Open roykyle8 opened 2 months ago

roykyle8 commented 2 months ago

Description

The ScrapyPriorityQueue throws a builtins.KeyError.

2024-03-04 09:33:09,684 2112359:CRITICAL [twisted] Unhandled Error
Traceback (most recent call last):
  File "/opt/custom_crawler.py", line 107, in main
    crawler_process.start()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/pqueues.py", line 98, in pop
    q = self.queues[self.curprio]
builtins.KeyError: 0

Steps to Reproduce

The error occurs randomly and not sure how to reproduce it.

Versions

2.6.2

Resolution:

Since python throws KeyError when the key does not exists in the queue(which is a dict here), should not we check if the key exists first?

if self.curprio is None can be replace with if self.curprio is None or self.curprio not in self.queues.

class ScrapyPriorityQueue:
        # other methods

    def pop(self):
        if self.curprio is None or self.curprio not in self.queues:
            return
        q = self.queues[self.curprio]
        m = q.pop()
        if not q:
            del self.queues[self.curprio]
            q.close()
            prios = [p for p, q in self.queues.items() if q]
            self.curprio = min(prios) if prios else None
        return m
wRAR commented 2 months ago

Do you have a disk queue corrupted by e.g. a prior spider crash?

Otherwise you need to make a minimal reproducible example of this problem.

wRAR commented 2 months ago

should not we check if the key exists first?

No, as it should exist there because of how is the class code written.

roykyle8 commented 2 months ago

@wRAR I clear the request cache directory before starting new run. Starting a fresh crawl with a corrupt queue is not at all possible.

Regarding the reproducible example, the does not occur every time and for every crawl. Hence reproducing it is very very difficult.

Gallaecio commented 2 months ago

I clear the request cache directory before starting new run.

HTTPCACHE_DIR? JOBDIR? Something else?

roykyle8 commented 2 months ago

@Gallaecio I clear both HTTPCACHE_DIR & JOBDIR before starting new run. So I've a custom script to start multiple spiders together (_customcrawler.py).

# custom_crawler.py:
def main(cmd_args):
    crawler_process = CrawlerProcess(None, install_root_handler=False)
    for sp in [cmd_args.spiders]:

        spider_instance = __load_spider_instance__(sp)
        http_cache_dir = f'/opt/http_cache/{spider_instance.name}'
        job_dir = f'/opt/job_dir/{spider_instance.name}'
        __delete_and_recreate_dir__(http_cache_dir)
        __delete_and_recreate_dir__(job_dir)

        custom_settings = {'HTTPCACHE_DIR': http_cache_dir, 'JOBDIR': Job_dir}
        # SCHEDULER_DISK_QUEUE: scrapy.squeues.PickleFifoDiskQueue is set in spider class
        # SCHEDULER_MEMORY_QUEUE: scrapy.squeues.FifoMemoryQueue is set in spider class
        getattr(spider_instance, 'custom_settings').update(custom_settings)
        crawler_process.crawl(Crawler(spider_instance))

     crawl_process.start()

The spiders runs fine but I get KeyError for some spiders. Also this error does not occur every run.

roykyle8 commented 2 months ago

@wRAR / @Gallaecio can you please guide how can I find the issue and fix this?

wRAR commented 2 months ago

No direct ideas but I would start with logging additions and removals for self.queues, to see if the code expectations (that the key always exists there when expected) are indeed broken, or aren't actually broken.

roykyle8 commented 1 month ago

@wRAR @Gallaecio still looking for a solution and trying to debug on KeyError, however started seeing another issue.

2024-03-27 03:30:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:31:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:32:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:33:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:34:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:35:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:36:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:37:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:38:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:39:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:40:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:41:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:42:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:43:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:44:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:45:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:46:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:47:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:48:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:49:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:50:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:51:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:52:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:53:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:54:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:55:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:56:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:57:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:58:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 03:59:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:00:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:01:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:02:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:03:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:04:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:05:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:06:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:07:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:08:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:09:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:10:09,967 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:11:09,963 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:12:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:13:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:14:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:15:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:16:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:17:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:18:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)
2024-03-27 04:19:09,964 1214455:INFO [scrapy.extensions.logstats] Crawled 3847 pages (at 0 pages/min), scraped 7313 items (at 0 items/min)

And the spider never completes. I've to kill the process and rerun.