Closed Zhou-Haowei closed 4 months ago
This report is not actionable as it is and should be distilled further, there are several things that are likely not related. Additionally, there are missing symbols and there is no actual URL to test (it is relevant whether or not an issue occurs in every case, in which case a generic URL like https://example.org could be used). Please see https://stackoverflow.com/help/minimal-reproducible-example.
I suspect the issue you're observing is probably related to the PLAYWRIGHT_MAX_CONTEXTS=1
& PLAYWRIGHT_MAX_PAGES_PER_CONTEXT=1
settings you have defined.
I think this problem is not related to the PLAYWRIGHT_MAX_CONTEXTS and PLAYWRIGHT_MAX_PAGES_PER_CONTEXT. The original values of these are 16 and 1. I deliberately set them to 1. The higher value of the PLAYWRIGHT_MAX_CONTEXTS setting only increases the time it takes for all contexts to hang. As for the url to test, it's NSFW, I don't know if it's against the rules to post it here. Just to be clear, I am working on developing a content filter, so I need to collect some datasets. My temporary solution is to create a process pool to run scrapy crawl tasks url by url. And kill the timeout CrawlProcess which most likely hangs.
same problem I can't figure out why, because I can't reproduce it stably.
Even when this occurs, sending a SIGINT signal does not terminate the crawler.
this is traceback when hang
>>> Signal received : entering python shell.
Traceback:
File "/usr/local/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
cmd.run(args, opts)
File "/app/info_scraper/commands/crawl.py", line 24, in run
self.crawler_process.start()
File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
reactor.run(installSignalHandlers=install_signal_handlers) # blocking call
File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
self._asyncioEventloop.run_forever()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/usr/local/lib/python3.10/selectors.py", line 469, in select
Signal received : entering python shell.
fd_event_list = self._selector.poll(timeout, max_ev)
Traceback:
File "/usr/local/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
cmd.run(args, opts)
File "/app/info_scraper/commands/crawl.py", line 24, in run
self.crawler_process.start()
File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
reactor.run(installSignalHandlers=install_signal_handlers) # blocking call
File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
self._asyncioEventloop.run_forever()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/usr/local/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
I'm revisiting this issue and the available code is still both too long (e.g. including unrelated settings and classes) and incomplete (e.g. not containing a full working spider). Please provide a minimal, reproducible example, otherwise the issue will be closed.
Closing due to inactivity.
@akkuman @Zhou-Haowei has the problem solved I am facing the same issue where the crawler gets stuck and to continue the process further I have to press Cmd+C manually
Configurations
Configurations 1: Model Name: MacBook Pro Chip: Apple M1 Max Cores: 10 (8 performance and 2 efficiency) Memory: 32GB System Version: macOS Sonoma 14.2.1 Playwright Version: 1.42.0 Python Version: Python 3.10.14 browser: chromium version 123.0.6312.4 browser: firefox version 123.0
Configurations 2: Model Name: Chip: Intel Xeon Cores: 4 Logical Cores Memory: 8GB System Version: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-76-generic x86_64) Docker Version: Docker version 25.0.4 Playwright Version: 1.42.0 Python Version: Python 3.10.13 browser: chromium version 123.0.6312.4 browser: firefox version 123.0
Description
I'm experiencing a problem where my Scrapy spider hangs indefinitely on a random url during the crawling process without throwing any exceptions. The spider uses Scrapy Playwright for page rendering. I have tried running my project on both of the above configurations and the same problem occurs.
Actual Behavior
Chromium and firefox will hangs on scraping tasks with different urls.
Code
Troubleshooting Steps Taken
cProfile
to profile the application, but no obvious bottlenecks were identified.Debugging
I tried to use python debugger to debug the code and I found following information:
goto
function was called.callback=self.response_callback
anderrback=self.failure_callback
function wasn't calledHuge thanks for any provided ideas!