scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 112 forks source link

Firefox does not work with proxy. #320

Open bboyadao opened 1 month ago

bboyadao commented 1 month ago

I just create an example spider. Chromium works well. but with the setup below. it's raise NS_ERROR_PROXY_CONNECTION_REFUSED from playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED

Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options.

async def _maybe_launch_browser(self) -> None:
    async with self.browser_launch_lock:
        if not hasattr(self, "browser"):
            logger.info("Launching browser %s", self.browser_type.name)
            self.browser = await self.browser_type.launch(**self.config.launch_options)
            logger.info("Browser %s launched", self.browser_type.name)
            self.stats.inc_value("playwright/browser_count")
            self.browser.on("disconnected", self._browser_disconnected_callback)

And i copy it to playwright to test and it's works.

example_spider.py

import scrapy
from rich import print

class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                'server': '127.0.0.1:8888',
                'username': 'username',
                'password': 'password'
            }
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),

            )
        )

    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        yield {}

test_with_playwright.py

import asyncio

from playwright.async_api import async_playwright

async def run_playwright_with_proxy():
    kwargs = {
        'headless': False, 
        'timeout': 20000,
        'proxy': {
            'server': '127.0.0.1:8888',
            'username': 'username',
            'password': 'password'
        }
    }

    async with async_playwright() as p:
        browser = await p.firefox.launch(**kwargs)
        page = await browser.new_page()
        await page.goto("https://httpbin.org/get")
        await asyncio.sleep(100)
        print("Page Title:", await page.title())
        await browser.close()

if __name__ == "__main__":
    asyncio.run(run_playwright_with_proxy())
elacuesta commented 1 month ago

I can not reproduce with mitmproxy:

$ mitmproxy --proxyauth "user:pass"

Screenshot at 2024-09-23 10-21-46

Slightly adapted sample spider:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                "server": "127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            }
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),

            )
        )

    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        page = response.meta["playwright_page"]
        await page.close()
$ scrapy runspider proxy.py
(...)
2024-09-23 10:21:22 [scrapy.core.engine] INFO: Spider opened
2024-09-23 10:21:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 10:21:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:27 [scrapy-playwright] INFO: Launching browser firefox
2024-09-23 10:21:27 [scrapy-playwright] INFO: Browser firefox launched
2024-09-23 10:21:27 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
2024-09-23 10:21:29 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

Which proxy are you using? Perhaps this is an interaction with that specific provider.

bboyadao commented 1 month ago

2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get

I have some thoughts

In my case scrapy got 407 then set it failure.

I use https://scrapoxy.io to manage proxies.

elacuesta commented 1 month ago
  • Look like scrapy got 407 at first.
  • Next request handled by playwright.

All requests were routed through Playwright, notice the "scrapy-playwright" logger name:

2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>

The provided spider works correctly with Scrapoxy. I've started it as indicated in their docs and I'm getting the following logs. There is a failure downloading the response, but that's reasonable because I did not add an actual proxy provider in the Scrapoxy configuration site.

2024-09-24 10:53:10 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:15 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:16 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <557 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy.core.engine] DEBUG: Crawled (557) <GET https://httpbin.org/get> (referer: None) ['playwright']
2024-09-24 10:53:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <557 https://httpbin.org/get>: HTTP status code is not handled or not allowed
2024-09-24 10:53:17 [scrapy.core.engine] INFO: Closing spider (finished)

However, if I pass incorrect credentials I do get the reported message:

2024-09-24 10:53:37 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:42 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:42 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/get>
Traceback (most recent call last):
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks
    result = context.run(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1251, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 378, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 460, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 560, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 8805, in goto
    await self._impl_obj.goto(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Call log:
navigating to "https://httpbin.org/get", waiting until "load"

2024-09-24 10:53:43 [scrapy.core.engine] INFO: Closing spider (finished)
honzajavorek commented 1 month ago

I also experienced NS_ERROR_PROXY_CONNECTION_REFUSED with Firefox. I'm pretty sure my proxy settings were right, but given the task at hand, my hunch is that this happens when the target blocks the proxy. I switched to Chromium just to test if the same scraper works better, and I get no errors. It's quite slow though, so superficially it seems that when the proxy gets blocked, scrapy-playwright knows how to recover and retry in case of Chromium, but fails with NS_ERROR_PROXY_CONNECTION_REFUSED in case of Firefox.

Update: With Chromium I get playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT instead 🤷‍♂️ Doesn't help me then to switch browsers, but perhaps this helps with figuring out what's the actual underlying problem.