scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

Overridden method for Playwright request to original=POST new=GET #239

Closed tommylge closed 6 months ago

tommylge commented 8 months ago

Hello, i'm facing an issue concerning this method. I don't understand why you do so, and it causes an error regarding my script.

I've replaced overrides["method"] = method with overrides["method"] = playwright_request.method.upper() and it works fine, i've seen some issues that might be related but not sure about the solution / answer you provide them.

scrapy_playwright/handler.py -> line n°598.

The issue caused when i do not replace your code:

DEBUG    [14:02:00]    DEBUG     [Context=unblocked] Overridden method for Playwright request to https://www.example.com/: original=POST new=GET  handler.py:611
ERROR    [14:02:00]    ERROR     Error downloading <GET https://www.example.com>                                                                  scraper.py:328
         Traceback (most recent call last):                                                                                                                        
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks               
             result = context.run(                                                                                                                                 
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in                                 
         throwExceptionIntoGenerator                                                                                                                               
             return g.throw(self.type, self.value, self.tb)                                                                                                        
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in                       
         process_request                                                                                                                                           
             return (yield download_func(request=request, spider=spider))                                                                                          
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 1065, in adapt                          
             extracted = result.result()                                                                                                                           
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 324, in                              
         _download_request                                                                                                                                         
             return await self._download_request_with_page(request, page, spider)                                                                                  
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 376, in                              
         _download_request_with_page                                                                                                                               
             await self._apply_page_methods(page, request, spider)                                                                                                 
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 490, in                              
         _apply_page_methods                                                                                                                                       
             pm.result = await _maybe_await(method(*pm.args, **pm.kwargs))                                                                                         
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/scrapy_playwright/_utils.py", line 16, in _maybe_await                   
             return await obj                                                                                                                                      
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 9408, in                       
         wait_for_url                                                                                                                                              
             await self._impl_obj.wait_for_url(                                                                                                                    
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/playwright/_impl/_page.py", line 498, in wait_for_url                    
             return await self._main_frame.wait_for_url(**locals_to_params(locals()))                                                                              
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 226, in wait_for_url                   
             async with self.expect_navigation(                                                                                                                    
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/playwright/_impl/_event_context_manager.py", line 33, in                 
         __aexit__                                                                                                                                                 
             await self._future                                                                                                                                    
           File "/Users/x/Desktop/driver_tester/.venv/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 203, in continuation                   
             raise Error(event["error"])                                                                                                                           
         playwright._impl._api_types.Error: resource exceeds maximum size

My script:

class TestSpider(scrapy.Spider):
    name = "test"

    def __init__(self, url: str, wait_url: str | None, *args, **kwargs) -> None:
        self.url = url
        self.wait_url = wait_url

        if not self.url:
            raise Exception('Missing url in spider.')

        super().__init__(*args, **kwargs)

    def start_requests(self):
        yield scrapy.Request(url=self.url, meta={
            'playwright': True,
            'playwright_include_page': True,
            'playwright_context': 'custom',
            'playwright_page_goto_kwargs': {
                'wait_until': 'load',
            },
            'playwright_page_methods': (
                PageMethod('wait_for_url', self.wait_url),
            ),
        })

    def parse(self, response: Response, **kwargs):
        LOGGER.info(f'[Spider] Parsing page: {response.url}')

My settings related:

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None

Possible related issues: https://github.com/scrapy-plugins/scrapy-playwright/issues/176

elacuesta commented 8 months ago

See this comment for an explanation on why it's necessary to override the method for certain requests. There's a number of things that need to happen for this to occur, make sure you're using the latest version of this package because this was modified not so long ago (#177).