scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496 #316

Open renatodvc opened 2 months ago

renatodvc commented 2 months ago

I've been debugging this problem for a while, it's intermittent making it harder to reproduce.

When running some jobs with scrapy-playwright the jobs get's abruptly terminated, if you observe the log of the job, it doesn't even acknowledges the termination, as it would in a SIGTERM case. The process apparently gets killed.

As an example, a simple spider (with scrapy-playwright) for scraping webstaurant.com, here is how the log terminates (literally the last 3 lines)

2024-09-02 23:15:59 [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {...}
2024-09-02 23:16:02 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 0 pages/min), scraped 173 items (at 0 items/min)
2024-09-02 23:16:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.webstaurantstore.com/10-strawberry-street-arctic-9c-arctic-blue-6-oz-porcelain-cup-pack/850ARCTIC9C.html>: HTTP status code is not handled or not allowed

I first noticed the problem when running the jobs with scrapyd, and here is what scrapyd logs when the problem happens:

2024-07-31T18:07:33+0000 [-] Process died: exitstatus=None  project='{my_project}' spider='webstaurant' job='ff9296c04f6211ef87de0242ac110004' pid=18659 args=['/usr/local/bin/python', '-m', 'scrapyd.runner', 'crawl', 'webstaurant', '-a', '_job=ff9296c04f6211ef87de0242ac110004', '-s', 'LOG_FILE=/var/lib/scrapyd/logs/{my_project}/webstaurant/ff9296c04f6211ef87de0242ac110004.log']

This is just for extra data, the problem is unrelated to scrapyd, since it's reproducible without it.

In all occurrences the error that seems to be the cause is a node error:

2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Response: <200 https://assets.customer.io/assets/in-app.js>
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Request: <GET https://code.gist.build/web/latest/gist.min.js> (resource type: script, referrer: https://www.webstaurantstore.com/)
2024-09-02T23:16:29+0000 [Launcher,16/stderr] node:events:496
          throw er; // Unhandled 'error' event
          ^

    Error: write EPIPE
        at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
    Emitted 'error' event on Socket instance at:
        at emitErrorNT (node:internal/streams/destroy:169:8)
        at emitErrorCloseNT (node:internal/streams/destroy:128:3)
        at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
      errno: -32,
      code: 'EPIPE',
      syscall: 'write'
    }

    Node.js v20.12.2

Finally, I found two issues in python-playwright that bear some resemblance, the first one appears logs the same exception and is caused by the handling of the event loop.

https://github.com/microsoft/playwright-python/issues/2275 https://github.com/microsoft/playwright-python/issues/2454

elacuesta commented 2 months ago

Please provide a minimal reproducible example. Also, additional information about your environment (e.g. scrapy version -v). This is to understand the situation better, because there is no direct handling of event loops except when running on Windows.

renatodvc commented 2 months ago

HI @elacuesta thank you for getting back to me.

Regarding the environment, I'm running Ubuntu 22.04.1 LTS on the VPS and scrapy==2.11.2. I've also reproduced the problem in other environments, but they were all Linux.

Unfortunately I don't know how to provide a MRE, giving that the issue is intermittent and I can't pin point the cause. I can provide the spider (which is pretty standard: start_requests -> parse -> item) and the input that is used by start_requests (which is fairly large) but that wouldn't be "minimal".

So far, there isn't a specific page that this seems to happen, to reproduce I just let it run until the process unexpectedly dies (from minutes to hours). Any insight on how to better debug this, or how I could provide more info would be great.

BTW I mentioned the event loop because it seems to be the cause of the problem for that particular issue linked, of course, it may not be related at all to this issue.

elacuesta commented 2 months ago

For some time I've suspected the request & response loggers might be causing some trouble by keeping references to the requests and responses longer than strictly necessary. This is by no means a tested hypothesis, but at this point I don't have any other explanation about this specific subject. Would you be able to try the code from the disable-request-response-logger branch (5de5a52df10d5ee4c3c7404c7e73017644549b54) and set PLAYWRIGHT_REQUEST_RESPONSE_LOGGER_ENABLED=False?

renatodvc commented 2 months ago

Thank you @elacuesta,

I've executed three jobs so far with the branch you linked, a quick summary of my findings are:

  1. It seems that the loggers are indeed related to the problem.
  2. I'll still run more tests, as my findings so far could just be a fluke.
  3. The jobs that didn't had early termination, ended up stalling, not performing new requests, but also not ending the job. ATM I'm not sure if the there are no more requests in the scheduler, or if there are but the spider can't perform them on Playwright for whatever reason.
  4. Of course there is also the chance that the stalling issue is a completely different issue, and unrelated to this.

For more details:

  1. My first job had the same early termination issue, same node:events:496 error logged. However, this job was using PLAYWRIGHT_ABORT_REQUEST, and because of it, the job kept logging the requests that were getting aborted. The other DEBUG request/response messages from scrapy-playwright weren't logged, as expected. E.g:
2024-09-11T21:45:54+0000 [Launcher,18/stderr] DEBUG:scrapy-playwright:[Context=default] Aborted Playwright request <GET https://bat.bing.com/bat.js>
  1. Since your suspicion was related to loggers, I ran other 2 jobs without PLAYWRIGHT_ABORT_REQUEST, so there wouldn't be the aborting request log message. In both jobs: