Open renatodvc opened 2 months ago
Please provide a minimal reproducible example. Also, additional information about your environment (e.g. scrapy version -v
). This is to understand the situation better, because there is no direct handling of event loops except when running on Windows.
HI @elacuesta thank you for getting back to me.
Regarding the environment, I'm running Ubuntu 22.04.1 LTS on the VPS and scrapy==2.11.2
. I've also reproduced the problem in other environments, but they were all Linux.
Unfortunately I don't know how to provide a MRE, giving that the issue is intermittent and I can't pin point the cause. I can provide the spider (which is pretty standard: start_requests -> parse -> item
) and the input that is used by start_requests
(which is fairly large) but that wouldn't be "minimal".
So far, there isn't a specific page that this seems to happen, to reproduce I just let it run until the process unexpectedly dies (from minutes to hours). Any insight on how to better debug this, or how I could provide more info would be great.
BTW I mentioned the event loop because it seems to be the cause of the problem for that particular issue linked, of course, it may not be related at all to this issue.
For some time I've suspected the request & response loggers might be causing some trouble by keeping references to the requests and responses longer than strictly necessary. This is by no means a tested hypothesis, but at this point I don't have any other explanation about this specific subject. Would you be able to try the code from the disable-request-response-logger branch (5de5a52df10d5ee4c3c7404c7e73017644549b54) and set PLAYWRIGHT_REQUEST_RESPONSE_LOGGER_ENABLED=False
?
Thank you @elacuesta,
I've executed three jobs so far with the branch you linked, a quick summary of my findings are:
node:events:496
error logged. However, this job was using PLAYWRIGHT_ABORT_REQUEST
, and because of it, the job kept logging the requests that were getting aborted. The other DEBUG request/response messages from scrapy-playwright weren't logged, as expected. E.g:2024-09-11T21:45:54+0000 [Launcher,18/stderr] DEBUG:scrapy-playwright:[Context=default] Aborted Playwright request <GET https://bat.bing.com/bat.js>
PLAYWRIGHT_ABORT_REQUEST
, so there wouldn't be the aborting request log message. In both jobs:2024-09-17 06:13:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
2024-09-17 06:14:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
...
2024-09-17 15:53:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
2024-09-17 15:54:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
CLOSESPIDER_TIMEOUT_NO_ITEM
, it logs the message, but it doesn't end the job:
2024-09-17 10:11:41 [scrapy.extensions.closespider] INFO: Closing spider since no items were produced in the last 7200 seconds.
2024-09-17 10:11:41 [scrapy.core.engine] INFO: Closing spider (closespider_timeout_no_item)
2024-09-17 10:11:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
...
I've been debugging this problem for a while, it's intermittent making it harder to reproduce.
When running some jobs with
scrapy-playwright
the jobs get's abruptly terminated, if you observe the log of the job, it doesn't even acknowledges the termination, as it would in a SIGTERM case. The process apparently gets killed.As an example, a simple spider (with
scrapy-playwright
) for scraping webstaurant.com, here is how the log terminates (literally the last 3 lines)I first noticed the problem when running the jobs with
scrapyd
, and here is what scrapyd logs when the problem happens:This is just for extra data, the problem is unrelated to
scrapyd
, since it's reproducible without it.In all occurrences the error that seems to be the cause is a node error:
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT
andPLAYWRIGHT_MAX_CONTEXTS
all the way to 1 had no effectFinally, I found two issues in
python-playwright
that bear some resemblance, the first one appears logs the same exception and is caused by the handling of the event loop.https://github.com/microsoft/playwright-python/issues/2275 https://github.com/microsoft/playwright-python/issues/2454