Open bboyadao opened 1 month ago
I can not reproduce with mitmproxy:
$ mitmproxy --proxyauth "user:pass"
Slightly adapted sample spider:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "ex"
start_urls = ["https://httpbin.org/get"]
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"PLAYWRIGHT_BROWSER_TYPE": "firefox",
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"headless": False,
"timeout": 20 * 1000,
'proxy': {
"server": "127.0.0.1:8080",
"username": "user",
"password": "pass",
}
},
}
def start_requests(self):
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse_detail,
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_context_kwargs=dict(
java_script_enabled=True,
ignore_https_errors=True,
),
)
)
async def parse_detail(self, response):
print(f"Received response from {response.url}")
page = response.meta["playwright_page"]
await page.close()
$ scrapy runspider proxy.py
(...)
2024-09-23 10:21:22 [scrapy.core.engine] INFO: Spider opened
2024-09-23 10:21:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 10:21:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:27 [scrapy-playwright] INFO: Launching browser firefox
2024-09-23 10:21:27 [scrapy-playwright] INFO: Browser firefox launched
2024-09-23 10:21:27 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
2024-09-23 10:21:29 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
Which proxy are you using? Perhaps this is an interaction with that specific provider.
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
I have some thoughts
In my case scrapy got 407 then set it failure.
I use https://scrapoxy.io to manage proxies.
- Look like scrapy got 407 at first.
- Next request handled by playwright.
All requests were routed through Playwright, notice the "scrapy-playwright" logger name:
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
The provided spider works correctly with Scrapoxy. I've started it as indicated in their docs and I'm getting the following logs. There is a failure downloading the response, but that's reasonable because I did not add an actual proxy provider in the Scrapoxy configuration site.
2024-09-24 10:53:10 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:15 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:16 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <557 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy.core.engine] DEBUG: Crawled (557) <GET https://httpbin.org/get> (referer: None) ['playwright']
2024-09-24 10:53:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <557 https://httpbin.org/get>: HTTP status code is not handled or not allowed
2024-09-24 10:53:17 [scrapy.core.engine] INFO: Closing spider (finished)
However, if I pass incorrect credentials I do get the reported message:
2024-09-24 10:53:37 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:42 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:42 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/get>
Traceback (most recent call last):
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks
result = context.run(
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
return g.throw(self.value.with_traceback(self.tb))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1251, in adapt
extracted: _SelfResultT | Failure = result.result()
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 378, in _download_request
return await self._download_request_with_retry(request=request, spider=spider)
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
return await self._download_request_with_page(request, page, spider)
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 460, in _download_request_with_page
response, download = await self._get_response_and_download(request, page, spider)
File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 560, in _get_response_and_download
response = await page.goto(url=request.url, **page_goto_kwargs)
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 8805, in goto
await self._impl_obj.goto(
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 524, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 145, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Call log:
navigating to "https://httpbin.org/get", waiting until "load"
2024-09-24 10:53:43 [scrapy.core.engine] INFO: Closing spider (finished)
I also experienced NS_ERROR_PROXY_CONNECTION_REFUSED
with Firefox. I'm pretty sure my proxy settings were right, but given the task at hand, my hunch is that this happens when the target blocks the proxy. I switched to Chromium just to test if the same scraper works better, and I get no errors. It's quite slow though, so superficially it seems that when the proxy gets blocked, scrapy-playwright
knows how to recover and retry in case of Chromium, but fails with NS_ERROR_PROXY_CONNECTION_REFUSED
in case of Firefox.
Update: With Chromium I get playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT
instead 🤷♂️ Doesn't help me then to switch browsers, but perhaps this helps with figuring out what's the actual underlying problem.
I just create an example spider. Chromium works well. but with the setup below. it's raise
NS_ERROR_PROXY_CONNECTION_REFUSED
fromplaywright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options.
And i copy it to playwright to test and it's works.
example_spider.py
test_with_playwright.py