scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.04k stars 112 forks source link

playwright._impl._api_types.Error: headers[6].value: expected string, got object #197

Closed UsmanMaan324 closed 1 year ago

UsmanMaan324 commented 1 year ago

System info

Source code

Link to the GitHub repository with the repro

[https://github.com/your_profile/playwright_issue_title]

or

Spider code

// playwright.config.ts
class ExSpider(scrapy.Spider):
    name = "ex_spider"
    custom_settings = {
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
        'DOWNLOAD_HANDLERS': {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'SCRAPY_PLAYWRIGHT_BROWSER_TYPE': 'chromium',
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': 0 * 1000,
        'PLAYWRIGHT_CONTEXTS': {
            "default": {
                "viewport": {
                    "width": 1920,
                    "height": 980,
                }
            }
        },
        'CONCURRENT_REQUESTS': 20,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 20,
        'CONCURRENT_ITEMS': 20,
        'REACTOR_THREADPOOL_MAXSIZE': 20,
        'RETRY_TIMES': 3,
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request,
    }

    def start_requests(self):
        url = "https://www.paniniamerica.net/checklist"
        logger.info("Start the scraper")
        req = scrapy.Request(url,
                             callback=self.parse_type,
                             meta=dict(
                                 playwright=True,
                                 playwright_context="default",
                                 errback=self.errback,
                                 playwright_include_page=True,
                                 playwright_page_methods=[
                                     PageMethod("wait_for_selector", "select#damage_type", timeout=0 * 1000,
                                                state="visible"),
                                     PageMethod("wait_for_load_state", "load")
                                 ]
                             ))
        print(req.headers)
        yield req

    async def parse_type(self, response):
        print("Here")
        page = response.meta["playwright_page"]
        await page.close()
        logger.info("Check point")
        select_types = response.css("select#damage_type::text").extract()
        logger.info(f"select type are {select_types}")

Steps Execute the following command to run the spider

Expected

Spider should run and the program counter should move the callback function that we attach with the scrapy request

Actual

Got the following error but couldn't get why this is happening

 [2023-05-09 07:48:29,579: ERROR/UrlCrawlerScript-1:1] Task exception was never retrieved
web_1         | future: <Task finished name='Task-16' coro=<Channel.send() done, defined at /.venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:38> exception=Error('headers[6].value: expected string, got object')>
web_1         | Traceback (most recent call last):
web_1         |   File "/.venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
web_1         |     return await self.inner_send(method, params, False)
web_1         |   File "/.venv/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
web_1         |     result = next(iter(done)).result()
web_1         | playwright._impl._api_types.Error: headers[6].value: expected string, got object
elacuesta commented 1 year ago

Sorry, I cannot reproduce. For the record, I had to comment out the 'PLAYWRIGHT_ABORT_REQUEST': should_abort_request, & errback=self.errback, lines because those objects are not defined in your example. Please include your full logs to continue debugging.

UsmanMaan324 commented 1 year ago

@elacuesta Thank you for your response. Are you using the above mentioned version of playwright, scrapy and scrapy-playwright. Although I got this error for only the url that is mentioned in the code. Working perfectly fine on other website. And I am using scrapy-playwright in docker container.