scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
992 stars 108 forks source link

Improve concurrency on Windows (Get future result in a callback) #285

Closed elacuesta closed 3 months ago

elacuesta commented 3 months ago

Closes #282

Sample spider:

import scrapy

class DelayTest(scrapy.Spider):
    name = "delay"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 32,
        "CONCURRENT_REQUESTS_PER_IP": 32,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 32,
        "CONCURRENT_REQUESTS": 32,
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": False},
    }

    def start_requests(self):
        for i in range(32):
            yield scrapy.Request(
                url=f"https://httpbin.org/delay/1?i={i}",
                meta={"playwright": True},
            )

    def parse(self, response):
        print(response.url)

The current implementation gives 'playwright/page_count/max_concurrent': 1, with this patch I've obtained up to 6, with an average of 3. This is still low, I suspect it' because the calls are still blocking and what I'm doing with this patch is just split the waiting in two blocks (asyncio.run_coroutine_threadsafe(...) & future.result()) allowing for other requests to be processed in the middle.

Alternative

I could defer all calls to threads, however I'm not sure if that's a good idea. It seems to provide better performance, though also not as high as I'd expect (up to 10 concurrent pages, these docs are not working for me to increase the amount of threads).

diff --git a/scrapy_playwright/_utils.py b/scrapy_playwright/_utils.py
index bc93f14..06897f0 100644
--- a/scrapy_playwright/_utils.py
+++ b/scrapy_playwright/_utils.py
@@ -1,9 +1,8 @@
 import asyncio
-import concurrent
 import logging
 import platform
 import threading
-from typing import Awaitable, Iterator, Optional, Tuple, Union
+from typing import Any, Awaitable, Iterator, Optional, Tuple, Union

 import scrapy
 from playwright.async_api import Error, Page, Request, Response
@@ -98,6 +97,8 @@ async def _get_header_value(

 if platform.system() == "Windows":

+    from twisted.internet import threads
+
     class _WindowsAdapter:
         """Utility class to redirect coroutines to an asyncio event loop running
         in a different thread. This allows to use a ProactorEventLoop, which is
@@ -121,11 +122,11 @@ if platform.system() == "Windows":
             return cls.loop

         @classmethod
-        async def get_result(cls, coro) -> concurrent.futures.Future:
+        def get_result(cls, coro) -> Any:
             return asyncio.run_coroutine_threadsafe(coro=coro, loop=cls.get_event_loop()).result()

     def _deferred_from_coro(coro) -> Deferred:
-        return scrapy.utils.defer.deferred_from_coro(_WindowsAdapter.get_result(coro))
+        return threads.deferToThread(_WindowsAdapter.get_result, coro)

 else:
     _deferred_from_coro = scrapy.utils.defer.deferred_from_coro
diff --git a/tests/__init__.py b/tests/__init__.py
index 1fddfd8..05695ed 100644
--- a/tests/__init__.py
+++ b/tests/__init__.py
@@ -21,9 +21,9 @@ if platform.system() == "Windows":
             raise RuntimeError(f"{test_method} must be an async def method")

         @wraps(test_method)
-        async def wrapped(self, *args, **kwargs):
+        def wrapped(self, *args, **kwargs):
             logger.debug("Calling _WindowsAdapter.get_result for %r", self)
-            await _WindowsAdapter.get_result(test_method(self, *args, **kwargs))
+            _WindowsAdapter.get_result(test_method(self, *args, **kwargs))

         return wrapped

@Gallaecio @wRAR I'd appreciate if you could give your opinion in these two approaches, thanks in advance.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (030c8a1) to head (2ffbdcb).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #285 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 6 6 Lines 517 518 +1 ========================================= + Hits 517 518 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Gallaecio commented 3 months ago

No strong opinion either way. Would it make sense to make this user-configurable?

elacuesta commented 3 months ago

Thanks for the comment @Gallaecio. I'm closing this in favor of #286, it's much more complex but it keeps the thread count low while providing good concurrency.