scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

Re-launch chromium browser option if the browser suddenly crashed? #294

Closed gelodefaultbrain closed 4 months ago

gelodefaultbrain commented 4 months ago

Hi! Sorry , is there a way that scrapy-playwright may able to re-launch the browser if ever the browser has crashed? Re-launched it and picked up from where it left? Let's say the browser suddenly crashed for URL number 50 then got an error or something that made it crash, then a configuration that would re-launch the browser and pick it up again and continue? Thanks!

I only tried this so far

class TimeOutExceptionHandlerMiddleware:

    def process_exception(self, request, exception, spider):
        if isinstance(exception, Exception):
            if self._is_browser_closed_exception(exception):
                spider.logger.error(f"Browser closed unexpectedly for {request.url}: {exception}")

            spider.logger.error(f"Timeout occurred for {request.url}: {exception}")

            return self._retry(request, exception, spider)

    def _retry(self, request, exception, spider):
        retries = request.meta.get('retry_times', 0) + 1
        retry_times = spider.custom_settings.get('RETRY_TIMES', 3)

        if retries <= retry_times:
            spider.logger.error(f"Retrying {retries}/{retry_times} for {request.url}")
            retry_req = request.copy()
            retry_req.dont_filter = True  # Ensure the request is retried
            retry_req.meta['retry_times'] = retries

            return retry_req

I just don't get it cause on the _retry we are returning the request object retry_req but it doesn't seem to work. Am I missing something?

elacuesta commented 4 months ago

Seems like a duplicate of #167, are you seeing playwright._impl._api_types.Error: Target page, context or browser has been closed in your logs?

gelodefaultbrain commented 4 months ago

Seems like a duplicate of #167, are you seeing playwright._impl._api_types.Error: Target page, context or browser has been closed in your logs?

Hi! yup I can confirm that I am seeing that. I've actually made some tries to re launch it but didn't worked that well so that's why I reached out here.

elacuesta commented 4 months ago

Got it, thanks for the confirmation. Given that I'm closing this in favor of #167, there is #295 already in the works.

gelodefaultbrain commented 4 months ago

Hi!!! Omg was the PR to enable relaunching already merged? Please reply asappp... If so how to do it? Thank you so much!!!

gelodefaultbrain commented 4 months ago

Got it, thanks for the confirmation. Given that I'm closing this in favor of #167, there is #295 already in the works.

Thanks @elacuesta

elacuesta commented 4 months ago

It was merged & released as v0.0.39. There's a new PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER setting which is True by default, so if you want it enabled there's no need to do anything.

gelodefaultbrain commented 4 months ago

It was merged & released as v0.0.39. There's a new PLAYWRIGHT_RESTART_DISCONNECTED_BROWSER setting which is True by default, so if you want it enabled there's no need to do anything.

Thanks but I think I need to update my installed scrapy playwright right?

gelodefaultbrain commented 4 months ago

Hi @elacuesta

Thanks for this I see that the updates were already there 5 days ago. Just wanted to ask and some clarifications

1.) It was mentioned here that "Restart on browser crash". So when I tried my scrapy-playwright spider by running it, I tried to forcefully close the browser but for some reason it doesn't pop up or retrying to the url where it left off. Im just wondering if why is that? or it doesn't cover that part after all? It does say "Target page, context or browser has been closed". So I'm wondering why.

Also I did encounter crashes (not by force) on my chromium browser when I was having a run and it did showed up "Target page, context or browser has been closed". Maybe it'll affect by then , we'll see

Thank you! PS: I've already updated my scrapy-playwright version to the latest btw.

elacuesta commented 4 months ago

You're correct, I've opened #304 about this. For now, as a workaround I'd recommend catching the exception with an errback and rescheduling the request, something like:

class MySpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True}, errback=self.errback)

    def errback(self, failure):
        print("Handling exception:", failure.value)
        yield failure.request.replace(dont_filter=True)