scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
907 stars 101 forks source link

Issue running scrape on Mac #264

Open lovgrandma opened 3 months ago

lovgrandma commented 3 months ago

I seem to be getting the following issue but I am unsure why the argument passed is invalid?

Model Name: MacBook Pro Model Identifier: Mac14,7 Model Number: MNEJ3LL/A Chip: Apple M2 Total Number of Cores: 8 (4 performance and 4 efficiency) Memory: 8 GB System Firmware Version: 8419.80.7 OS Loader Version: 8419.80.7 System Version: macOS 13.2.1 (22D68) Kernel Version: Darwin 22.3.0 Boot Volume: Macintosh HD Boot Mode: Normal

See code:

def parse(self, response):
        # Check if the page is a log-in or authentication page
        if self.is_login_page(response):
            self.logger.info(f"Ignoring log-in page: {response.url}")
            return

        print("Extracting")
        # Extract data from the current page
        extracted_data = self.extract_data(response)

        print("Update Meta")
        # Insert the entire response into the database
        self.update_meta(response.url, extracted_data)

        print("Adding Url", response.url)
        self.visited_urls.add(response.url)
        yield extracted_data

        print("View Response", response)

        # Extracting links to other pages
        for link in response.css("a::attr(href)").getall():
            absolute_url = urljoin(response.url, link)
            if absolute_url.startswith("javascript:"):
                continue  # Ignore JavaScript links
            if absolute_url not in self.visited_urls:
                print("Run Req", absolute_url)
                self.visited_urls.add(absolute_url) # Avoid re-scrape now that we're running request for this link
                yield scrapy.Request(
                    url=absolute_url, callback=self.parse, errback=self.error_handler, meta={"playwright": True}
                )

See error below. I am unsure what to make of it besides a bad argument but where?

New Url {'domain': 'www.acnestudios.com', 'raw': 'https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3', 'lastScrape': datetime.datetime(2024, 3, 16, 15, 25, 27, 347681)}
2024-03-16 15:25:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.acnestudios.com/robots.txt> (referer: None)
2024-03-16 15:25:30 [scrapy-playwright] INFO: Launching browser chromium
2024-03-16 15:25:31 [scrapy-playwright] INFO: Browser chromium launched
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3> (resource type: document)
2024-03-16 15:25:31 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3> exc_type=<class 'playwright._impl._errors.Error'> exc_msg=net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
Traceback (most recent call last):
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 340, in _download_request
    return await self._download_request_with_page(request, page, spider)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 369, in _download_request_with_page
    response, download = await self._get_response_and_download(request=request, page=page)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 461, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 8612, in goto
    await self._impl_obj.goto(
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_page.py", line 500, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 509, in wrap_api_call
    return await cb()
           ^^^^^^^^^^
  File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 97, in inner_send
    result = next(iter(done)).result()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._errors.Error: net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
2024-03-16 15:25:31 [TycoonSpider] ERROR: Error: net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
2024-03-16 15:25:31 [scrapy.core.engine] INFO: Closing spider (finished)
lovgrandma commented 3 months ago

When I print the types of those 3 arguements being passed into where that error occurs I get the following:

Request: <class 'scrapy.http.request.Request'> Page: <class 'playwright.async_api._generated.Page'> Spider:<class 'tycoon_crawler.spiders.tycoon_spider.TycoonSpider'>

Unsure why I get 2024-03-16 15:25:31 [scrapy-playwright] WARNING: Closing page due to failed request: ........ ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/tw.....

because the declaration looks like so:

async def _download_request_with_page( self, request: Request, page: Page, spider: Spider ) -> Response:

elacuesta commented 3 months ago

This is not an incorrect argument being passed between methods, the net::ERR_INVALID_ARGUMENT error comes from Chromium: search for "chromium net::ERR_INVALID_ARGUMENT" and you'll see many results. I'm not 100% sure, but IIRC this might be related to SSL verification. I think I remember seeing it myself when working with proxies that needed an external certificate to verify HTTPS connections.