scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Page.goto: net::ERR_INVALID_ARGUMENT #327

Open junior-g opened 2 weeks ago

junior-g commented 2 weeks ago

I am getting following error for my basic scrapy with playwright error:

Request: <GET https://www.croma.com/robots.txt> (resource type: document)
2024-11-12 17:41:00 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.croma.com/robots.txt>: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/robots.txt
Call log:
navigating to "https://www.croma.com/robots.txt", waiting until "load"

Request  ------  <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Request Headers:  <CaseInsensitiveDict: {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Accept-Language': 'en-US,en;q=0.9', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36', 'Upgrade-Insecure-Requests': '1', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Connection': 'keep-alive', 'Host': 'www.croma.com', 'Sec-Ch-Ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'Sec-Ch-Ua-Mobile': '?0', 'Sec-Ch-Ua-Platform': '"macOS"', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-User': '?1'}>
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2024-11-12 17:41:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880> (resource type: document)
2024-11-12 17:41:00 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
    result = context.run(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1253, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 379, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 432, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 461, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 563, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 8818, in goto
    await self._impl_obj.goto(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT at https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880
Call log:
navigating to "https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880", waiting until "load"
# CromaSpider
class CromaSpider(scrapy.Spider):
    name = "croma"
    allowed_domains = ["www.croma.com"]
    start_urls = ["https://www.croma.com/iffalcon-q73-126-cm-50-inch-4k-ultra-hd-qled-google-tv-with-dolby-audio-2023-model-/p/307880"]

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()
        print("Response parsing")
        print(response.xpath('/html/body/main/div/div[3]/div/div[1]/div[2]/div[1]/div/div/div/div[3]/div/ul/li').get()
              )
        pass

# middleware.py
----
 request.meta["playwright"] = True

request.meta["playwright_include_page"] = True
---

# setting.py
DOWNLOADER_MIDDLEWARES = {
   "scrapy_2_crawl_service.middlewares.Scrapy2CrawlServiceDownloaderMiddleware": 543
}
DOWNLOAD_HANDLERS = {
   "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
   "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

I am following this - https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/

Why I am getting this error

(edited to adjust formatting)

elacuesta commented 2 weeks ago

I'm sorry, I cannot reproduce:

# test.py
import scrapy

class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
        },
        "USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
        "LOG_LEVEL": "INFO",
    }

    def start_requests(self):
        yield scrapy.Request(
            url="URL",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="croma.png")
        await page.close()
        print("Response parsing")
        print(response.xpath("//h1/text()").get())
$ scrapy runspider test.py
...
2024-11-12 16:50:44 [scrapy.core.engine] INFO: Spider opened
2024-11-12 16:50:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-11-12 16:50:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:44 [scrapy-playwright] INFO: Starting download handler
2024-11-12 16:50:49 [scrapy-playwright] INFO: Launching browser chromium
2024-11-12 16:50:49 [scrapy-playwright] INFO: Browser chromium launched
Response parsing
iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model) 
2024-11-12 16:50:52 [scrapy.core.engine] INFO: Closing spider (finished)
...

Note that I had to set a custom User-Agent, otherwise I was getting 403 status responses.

Versions used:

$ playwright --version       
Version 1.48.0

$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.42

$ scrapy version -v                   
Scrapy       : 2.11.2
lxml         : 5.2.2.0
libxml2      : 2.12.6
cssselect    : 1.2.0
parsel       : 1.9.1
w3lib        : 2.2.1
Twisted      : 24.3.0
Python       : 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
pyOpenSSL    : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35
junior-g commented 2 weeks ago

@elacuesta thanks for the quick reply. Yes it is working when working on project separately. but one more issue when I make

"PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": True,
        },

the prints None not the title. why so?

elacuesta commented 1 week ago

I see. I suppose the site could be detecting and blocking headless browsers, I'm seeing the same behavior with standalone Playwright:

import asyncio

from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(URL)
        await page.screenshot(path="page.png")
        print(await page.locator("//h1").text_content())
        await browser.close()

if __name__ == "__main__":
    asyncio.run(main())

prints iFFALCON Q73 126 cm (50 inch) 4K Ultra HD QLED Google TV with Dolby Audio (2023 model), however by passing headless=True I get a 403 response with "Access Denied".