scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

PLAYWRIGHT_ABORT_REQUEST not working well when PLAYWRIGHT_BROWSER_TYPE as 'webkit' #269

Closed partyspy closed 4 months ago

partyspy commented 7 months ago

Environment

When PLAYWRIGHT_BROWSER_TYPE set as 'chromium' (or default) under macOS, , there appears to be a memory leak as number of crawled pages increased. Meanwhile no memory leak is found under Linux.

When PLAYWRIGHT_BROWSER_TYPE set as 'webkit' under macOS, the memory leak issue is gone but the PLAYWRIGHT_ABORT_REQUEST callback fails to intercept the most parts of requests.

def should_abort_request(request):
        return (
            request.resource_type == "image"
            or ".jpg" in request.url
            or "ajax1" in request.url
            or "ajax2" in request.url
            or "ajax3" in request.url
        )

# Spider settings regarding playerwright:

custom_settings = {
        'DOWNLOAD_HANDLERS' : {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor",

        'PLAYWRIGHT_BROWSER_TYPE': "webkit", 
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request,
}

# The Request meta set as:
meta={
    "playwright": True, 
    "playwright_page_goto_kwargs": {"wait_until": "networkidle"}
},
elacuesta commented 5 months ago

Sorry, I cannot reproduce.

$ scrapy version -v
Scrapy       : 2.11.2
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 24.3.0
Python       : 3.10.10 (main, Feb 16 2023, 02:58:25) [Clang 14.0.0 (clang-1400.0.29.202)]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : macOS-14.4.1-x86_64-i386-64bit

$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.35
import scrapy

def should_abort_request(request):
    return request.resource_type == "image" or ".jpg" in request.url

class ExampleSpider(scrapy.Spider):
    name = "example"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "webkit",
        "PLAYWRIGHT_ABORT_REQUEST": should_abort_request,
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {"wait_until": "networkidle"},
            },
        )

    def parse(self, response):
        yield {"url": response.url}
(...)
2024-06-03 22:17:29 [scrapy.core.engine] INFO: Spider opened
2024-06-03 22:17:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-03 22:17:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-06-03 22:17:29 [scrapy-playwright] INFO: Starting download handler
2024-06-03 22:17:34 [scrapy-playwright] INFO: Launching browser webkit
2024-06-03 22:17:34 [scrapy-playwright] INFO: Browser webkit launched
2024-06-03 22:17:35 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-06-03 22:17:35 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-06-03 22:17:35 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/> (resource type: document)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/css/styles.css> (resource type: stylesheet, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css> (resource type: stylesheet, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/css/datetimepicker.css> (resource type: stylesheet, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg> (resource type: image, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/bootstrap3/bootstrap.min.js> (resource type: script, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/oscar/ui.js> (resource type: script, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.js> (resource type: script, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js> (resource type: script, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/css/styles.css>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/css/datetimepicker.css>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/js/jquery/jquery-1.9.1.min.js> (resource type: script, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://books.toscrape.com/static/oscar/fonts/fontawesome-webfont.woff%3Fv=3.2.1> (resource type: font, referrer: https://books.toscrape.com/)
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Aborted Playwright request <GET https://books.toscrape.com/media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/bootstrap3/bootstrap.min.js>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.js>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/oscar/ui.js>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js>
2024-06-03 22:17:36 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/fonts/fontawesome-webfont.woff%3Fv=3.2.1>
2024-06-03 22:17:37 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://books.toscrape.com/static/oscar/js/jquery/jquery-1.9.1.min.js>
2024-06-03 22:17:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com> (referer: None) ['playwright']
2024-06-03 22:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/>
{'url': 'https://books.toscrape.com/'}
2024-06-03 22:17:37 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-03 22:17:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 51287,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 8.153309,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 6, 4, 1, 17, 37, 725425, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'log_count/DEBUG': 67,
 'log_count/INFO': 13,
 'log_count/WARNING': 1,
 'memusage/max': 57114624,
 'memusage/startup': 57110528,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/closed': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 30,
 'playwright/request_count/aborted': 20,
 'playwright/request_count/method/GET': 30,
 'playwright/request_count/navigation': 1,
 'playwright/request_count/resource_type/document': 1,
 'playwright/request_count/resource_type/font': 1,
 'playwright/request_count/resource_type/image': 20,
 'playwright/request_count/resource_type/script': 5,
 'playwright/request_count/resource_type/stylesheet': 3,
 'playwright/response_count': 10,
 'playwright/response_count/method/GET': 10,
 'playwright/response_count/resource_type/document': 1,
 'playwright/response_count/resource_type/font': 1,
 'playwright/response_count/resource_type/script': 5,
 'playwright/response_count/resource_type/stylesheet': 3,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 6, 4, 1, 17, 29, 572116, tzinfo=datetime.timezone.utc)}
2024-06-03 22:17:37 [scrapy.core.engine] INFO: Spider closed (finished)
2024-06-03 22:17:37 [scrapy-playwright] INFO: Closing download handler
2024-06-03 22:17:37 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2024-06-03 22:17:37 [scrapy-playwright] INFO: Closing browser

Notice the "Aborted Playwright request" log lines and the 'playwright/request_count/aborted': 20, entry in the job stats.