scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
911 stars 101 forks source link

KeyError: 'playwright_page' #272

Closed Nekender02 closed 3 weeks ago

Nekender02 commented 1 month ago
async def errback_close_page(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

def start_requests(self):
        if not self.start_urls and hasattr(self, "start_url"):
            raise AttributeError(
                "Crawling could not start: 'start_urls' not found "
                "or empty (but found 'start_url' attribute instead, "
                "did you miss an 's'?)"
            )
        for url in self.start_urls:
            npo = self.npos[url]
            logging.info("### crawl: %s", url)
            yield scrapy.Request(
                url, callback=self.my_parse, dont_filter=True,meta={"playwright": True, "playwright_include_page": True, 'start_time': datetime.utcnow()}, cb_kwargs={"npo": npo},errback= self.errback_close_page

            )

can anyone please explain why am I getting this error and how can I fix this?

Traceback (most recent call last): File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1065, in adapt extracted = result.result() File "/home/ec2-user/SageMaker/xx", line 50, in errback_close_page page = failure.request.meta["playwright_page"] KeyError: 'playwright_page'

elacuesta commented 1 month ago

It could be that you're not correctly activating the download handler. Are you seeing log lines from the scrapy-playwright logger? i.e. something like:

2024-05-25 16:06:00 [scrapy-playwright] INFO: Starting download handler
2024-05-25 16:06:05 [scrapy-playwright] INFO: Launching browser chromium
2024-05-25 16:06:05 [scrapy-playwright] INFO: Browser chromium launched
Nekender02 commented 1 month ago

Yes , image

i have added download handlers in my settings.py DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }

Nekender02 commented 1 month ago

Any updates ? still not able to solve the KeyError: 'playwright_page'

elacuesta commented 1 month ago

There is not enough information to reproduce. Please refer to https://github.com/scrapy-plugins/scrapy-playwright?tab=readme-ov-file#reporting-issues.

jiaohu commented 1 month ago

More Detail I want to use my spider to get info from https://www.manamana.net/exploreauth, I write code obey the sample, but get this error too. And then I translate my code as this refer

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://www.manamana.net/exploreauth")
        await page.screenshot(path="example.png", full_page=True)
        await browser.close()

asyncio.run(main())

Then I get the picture example But I want to get this image image

The reason is that there are several seconds between this two kind, I think there are some bug for playwright to render the page which is dynamic

elacuesta commented 1 month ago

The reason is that there are several seconds between this two kind, I think there are some bug for playwright to render the page which is dynamic

This is precisely why the docs suggest trying standalone Playwright for debugging. You are not getting the result you expect with standalone Playwright, so the problem is not caused by scrapy-playwright.

This is not a bug though. I'd recommend ensuring the DOM is fully loaded before returning the result by passing wait_until="domcontentloaded" to Page.goto in standalone Playwright or via the playwright_page_goto_kwargs Request meta key scrapy-playwright.

but get this error too

If by "this error" you mean the originally reported error about the a KeyError on the playwright_page meta key, there is no Scrapy code in your to work with.

Nekender02 commented 1 month ago

Additional information with logs

async def errback_close_page(self, failure):
        print(f"===================================================== {failure.request.meta}============================")
        page = failure.request.meta["playwright_page"]
        print(f"xxxxxxxxxxxxxxxxxxxxx {page} is closed xxxxxxxxxxxxxxxxxxxxxxxxxxxx")
        await page.close()

def start_requests(self):
        if not self.start_urls and hasattr(self, "start_url"):
            raise AttributeError(
                "Crawling could not start: 'start_urls' not found "
                "or empty (but found 'start_url' attribute instead, "
                "did you miss an 's'?)"
            )
        for url in self.start_urls:
            npo = self.npos[url]
            logging.info("### crawl: %s", url)
            yield scrapy.Request(
                url, callback=self.my_parse, dont_filter=True, meta=dict(playwright=True, playwright_include_page=True, start_time=datetime.utcnow()), cb_kwargs={"npo": npo}, errback=self.errback_close_page

            )

So here is the log with printing meta, as you can see there is no playwright_page in meta for second print

2024-06-02 21:13:05 [root] INFO: ### crawl: http://www.hhwf.org
2024-06-02 21:13:05 [root] INFO: ### crawl: http://thekauaimarathon.com
2024-06-02 21:13:05 [root] INFO: ### crawl: http://www.ohiorecruiters.org
2024-06-02 21:13:05 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:05 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:05 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:06 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.ohiorecruiters.org/robots.txt>: DNS lookup failed: no results for hostname lookup: www.ohiorecruiters.org.
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/endpoints.py", line 1022, in startConnectionAttempts
    raise error.DNSLookupError(
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.ohiorecruiters.org.
2024-06-02 21:13:06 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:07 [scrapy-playwright] INFO: Browser chromium launched
**===================================================== {'playwright': True, 'playwright_include_page': True, 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 5, 727733), 'download_timeout': 180.0, 'download_slot': 'www.ohiorecruiters.org', 'playwright_context': 'default', 'playwright_page': <Page url='chrome-error://chromewebdata/'>}============================**
xxxxxxxxxxxxxxxxxxxxx <Page url='chrome-error://chromewebdata/'> is closed xxxxxxxxxxxxxxxxxxxxxxxxxxxx
2024-06-02 21:13:08 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-02 21:13:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/playwright._impl._api_types.Error': 1,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
 'downloader/request_bytes': 612,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'elapsed_time_seconds': 8.041216,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 6, 2, 21, 13, 8, 767138),
 'log_count/ERROR': 1,
 'log_count/INFO': 28,
 'memusage/max': 175394816,
 'memusage/startup': 175394816,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 1,
 'playwright/request_count/method/GET': 1,
 'playwright/request_count/navigation': 1,
 'playwright/request_count/resource_type/document': 1,
 "robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 0, 725922)}
2024-06-02 21:13:08 [scrapy.core.engine] INFO: Spider closed (finished)
2024-06-02 21:13:08 [scrapy-playwright] INFO: Closing download handler
2024-06-02 21:13:08 [scrapy-playwright] INFO: Closing browser
2024-06-02 21:13:09 [scrapy-playwright] INFO: Closing download handler
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/'>
2024-06-02 21:13:11 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:11 [scrapy-playwright] INFO: Browser chromium launched
$$$$$$$$$$$$$$$$ able to access page <Page url='https://hhwf.org/'>
2024-06-02 21:13:13 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-02 21:13:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6612,
 'downloader/request_count': 22,
 'downloader/request_method_count/GET': 22,
 'downloader/response_bytes': 79598,
 'downloader/response_count': 22,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 21,
 'elapsed_time_seconds': 12.589505,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 6, 2, 21, 13, 13, 292754),
 'item_scraped_count': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 50,
 'memusage/max': 174043136,
 'memusage/startup': 174043136,
 'offsite/domains': 2,
 'offsite/filtered': 31,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 71,
 'playwright/request_count/method/GET': 70,
 'playwright/request_count/method/POST': 1,
 'playwright/request_count/navigation': 4,
 'playwright/request_count/resource_type/document': 4,
 'playwright/request_count/resource_type/fetch': 2,
 'playwright/request_count/resource_type/font': 9,
 'playwright/request_count/resource_type/image': 12,
 'playwright/request_count/resource_type/other': 2,
 'playwright/request_count/resource_type/ping': 1,
 'playwright/request_count/resource_type/script': 29,
 'playwright/request_count/resource_type/stylesheet': 12,
 'playwright/response_count': 71,
 'playwright/response_count/method/GET': 70,
 'playwright/response_count/method/POST': 1,
 'playwright/response_count/resource_type/document': 4,
 'playwright/response_count/resource_type/fetch': 2,
 'playwright/response_count/resource_type/font': 9,
 'playwright/response_count/resource_type/image': 12,
 'playwright/response_count/resource_type/other': 2,
 'playwright/response_count/resource_type/ping': 1,
 'playwright/response_count/resource_type/script': 29,
 'playwright/response_count/resource_type/stylesheet': 12,
 'request_depth_max': 1,
 'response_received_count': 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 0, 703249)}
2024-06-02 21:13:13 [scrapy.core.engine] INFO: Spider closed (finished)
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing download handler
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing browser
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing download handler
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/'>
2024-06-02 21:13:14 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:14 [scrapy-playwright] INFO: Browser chromium launched
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/support-revere'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/board'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/about-us'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/volunteer'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/cart'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/blog'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/about-us'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.paypal.com/donate/?cmd=_s-xclick&hosted_button_id=9H2AWXUE4L3M8&source=url&ssrt=1717362812122'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/back-to-school'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/2019-scholarship-awards'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/revere-hall-of-fame-induction-2019'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/community-health-and-wellness-night-2019'>
**===================================================== {'playwright': True, 'playwright_include_page': True, 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 51, 897863), 'depth': 3}============================**
2024-06-02 21:13:52 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.reverefoundation.com/recent-news-1?author=51140743e4b099bd04eedd7d> (referer: https://www.reverefoundation.com/recent-news-1/2019-scholarship-awards)
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1065, in adapt
    extracted = result.result()
  File "/home/ec2-user/SageMaker/grant-crawler/giboo/spiders/npos.py", line 50, in errback_close_page
    page = failure.request.meta["playwright_page"]
KeyError: 'playwright_page'
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/marathon-and-half-marathon/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/keiki-run/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/the-course/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/expo/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/race-activities/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/faqs/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/see-something-say-something-spring-2019'>
2024-06-02 21:14:00 [scrapy.extensions.logstats] INFO: Crawled 18 pages (at 18 pages/min), scraped 77 items (at 77 items/min)
2024-06-02 21:14:00 [scrapy.extensions.logstats] INFO: Crawled 10 pages (at 10 pages/min), scraped 12 items (at 12 items/min)
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/2019'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/produce'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/protect'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/photos-and-videos/'>

image

elacuesta commented 3 weeks ago

The failing requests have a referrer header, they are not coming from your start_requests method. You haven't shared a full spider, I suspect those specific requests do not have playwright_include_page=True in their meta. I'm closing this, if you're still having trouble and want to reopen the issue you need to include a Minimal, Reproducible Example as instructed in the Reporting issues section.

Nekender02 commented 1 week ago

So on further analysis I was able to pin point to exact error , Forbidden by robot.txt is causing the crawler to freeze. The issue is failure requests that are forbidden by robot.txt has no playwright_page associated to them, therefore i am not able to close them using page.close which is causing my crawler to freeze where page = failure.request.meta["playwright_page"] Note

Here is the minimal reproduceable code

import re
import scrapy
import logging
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urlparse
from scrapy.linkextractors import LinkExtractor

emails_re = re.compile(r"\b([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b", re.IGNORECASE)

class GrantsSpider(scrapy.Spider):
    name = "npos"
    reported_links = []
    link_extractor = LinkExtractor(unique=True)
    npos = {}

    async def errback_close_page(self, failure):
        logging.error(f"Error processing {failure.request.url}: {repr(failure)}")
        page = failure.request.meta.get("playwright_page")
        if page:
            await page.close()
            await page.context.close()
            logging.info(f"Page {page} closed on error")

    def start_requests(self):
        url = 'https://www.guidestar.org/profile/26-1700710'
        npo = {"ein": "26-1700710", "name": "Example NPO", "type": "Nonprofit"}
        logging.info(f"### crawl: {url}")
        yield scrapy.Request(
            url, callback=self.my_parse, dont_filter=True,
            meta={"playwright": True, "playwright_include_page": True, 'dont_redirect': True},
            cb_kwargs={"npo": npo},
            errback=self.errback_close_page
        )

    async def my_parse(self, response, npo):
        page = response.meta["playwright_page"]
        self.reported_links.append(response.url)
        try:
            _ = response.text
        except AttributeError as exc:
            logging.debug(f"Skipping {response.url}: {exc}")
            await page.close()
            return

        body, match = self.is_page(response, None)
        for email in emails_re.findall(body):
            yield {
                "ein": npo["ein"],
                "name": npo["name"],
                "type": npo["type"],
                "msg": "link-email",
                "match": email,
                "text": email,
                "url": response.url,
                "timestamp": datetime.utcnow(),
            }

        for link in response.xpath("//a"):
            href = link.xpath("./@href").get()

            if not href or href.startswith("javascript:") or href.startswith("#"):
                continue

            if not href.startswith("http"):
                href = response.urljoin(href)

            if href not in self.reported_links:
                yield scrapy.Request(
                    href, callback=self.my_parse,
                    meta={"playwright": True, "playwright_include_page": True, 'dont_redirect': True},
                    cb_kwargs={"npo": npo},
                    errback=self.errback_close_page
                )

        await page.close()

    def is_page(self, response, re_expression):
        sel = scrapy.Selector(response)
        sel.xpath("//head").remove()
        sel.xpath("//header").remove()
        # sel.xpath("//footer").remove()
        sel.xpath("//navbar").remove()
        sel.xpath("//a").remove()
        body = sel.get()
        bs_doc = BeautifulSoup(body, features="lxml").get_text(strip=True, separator=" ")
        if not re_expression:
            return bs_doc, None
        if re_expression.search(bs_doc):
            matches = list(set(list(re.findall(re_expression, bs_doc))[0]))
            if "" in matches:
                matches.remove("")
            return bs_doc, matches
        return None, None
elacuesta commented 1 week ago

Sounds reasonable, from a quick look I think this is the Scrapy exception you're getting.

I'd say that's expected behavior though: the request doesn't reach the download handler, hence no page is created. You'll just need to check in your errback if there is actually a page to close before attempting to close it. Furthermore, unless your actual spider is bigger and what you posted is a reduced example, I'd recommend you ask yourself if it's really necessary to pass playwright_include_page=True. You don't seem to be interacting with the page besides closing it, you might be better off not passing playwright_include_page and allowing the handler to close the page on its own. See also these docs.

Nekender02 commented 1 week ago

The actually spider is much bigger and this is just a minimal code of it. The reason for setting playwright_include_page=True is that it gives us more control over the closing of pages, this way we can better manage memory consumption. Anyways I was able to solve this by changing my errback method a bit

async def errback_close_page(self, failure):
        self.logger.error(f'Error processing page: {repr(failure)}')
        if "playwright_page" in failure.request.meta:
            page = failure.request.meta["playwright_page"]
            if page:
                await page.close()
                self.logger.info(f"Closed page due to error: {page}")
        raise CloseSpider(reason='Forbidden by robots.txt')
Nekender02 commented 1 week ago

Hi @elacuesta, even after making the changes I am still facing the issue with some links where my spider gets stuck even if its closed

{'BOT_NAME': 'test',
 'CLOSESPIDER_ITEMCOUNT': 300,
 'CLOSESPIDER_PAGECOUNT': 300,
 'CLOSESPIDER_TIMEOUT': 300,
 'CONCURRENT_REQUESTS': 8,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
 'CONCURRENT_REQUESTS_PER_IP': 4,
 'COOKIES_ENABLED': False,
 'DEPTH_LIMIT': 2,
 'DEPTH_PRIORITY': 1,
 'DNS_TIMEOUT': 3,
 'DOWNLOAD_DELAY': 0.3,
 'DOWNLOAD_TIMEOUT': 120,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'myConfig.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'RETRY_ENABLED': False,
 'ROBOTSTXT_OBEY': True,
 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
 'SPIDER_MODULES': ['myConfig.spiders'],
 'TELNETCONSOLE_ENABLED': False,
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'myConfig.middlewares.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-06-26 20:30:37 [scrapy.core.engine] INFO: Spider opened
2024-06-26 20:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
#### Run crawlers 25040-25040
2024-06-26 20:30:38 [scrapy-playwright] INFO: Starting download handler
2024-06-26 20:30:38 [scrapy-playwright] INFO: Starting download handler
2024-06-26 20:30:43 [root] INFO: ### crawl: http://www.trinityschooloftexas.com
2024-06-26 20:30:43 [scrapy-playwright] INFO: Launching browser chromium
2024-06-26 20:30:43 [scrapy-playwright] INFO: Browser chromium launched
2024-06-26 20:30:49 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.trinityschooloftexas.com> (referer: None)
2024-06-26 20:30:49 [scrapy-playwright] INFO: Launching browser chromium
2024-06-26 20:30:50 [scrapy-playwright] INFO: Browser chromium launched
2024-06-26 20:31:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:32:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:33:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:34:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:35:38 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2024-06-26 20:35:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:36:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:37:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

I am linking the GitHub repo with minimal code which will help you replicate the issue , any help would be appreciated. Thanks https://github.com/nekender/Playwright_Issue/

Nekender02 commented 1 week ago

Hi @elacuesta , did you get the time to have look ?

elacuesta commented 1 week ago

Is not clear to me what the issue is in your last report. Furthermore, what you shared is not minimal: there are multiple files, middlewares, regex processing, selector manipulation, HTML parsing with BeautifulSoup, etc, none of which are likely to be related to what you report. It is often the case that developers understand and solve issues on their own during the process of distilling a program down to the bare minimum necessary to reproduce a problem. So far this issue doesn't contain any reproducible bug report, it has always been a support matter and there are better resources to handle those (e.g. the scrapy-playwright tag at StackOverflow).