Closed Nekender02 closed 3 weeks ago
It could be that you're not correctly activating the download handler. Are you seeing log lines from the scrapy-playwright
logger? i.e. something like:
2024-05-25 16:06:00 [scrapy-playwright] INFO: Starting download handler
2024-05-25 16:06:05 [scrapy-playwright] INFO: Launching browser chromium
2024-05-25 16:06:05 [scrapy-playwright] INFO: Browser chromium launched
Yes ,
i have added download handlers in my settings.py DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }
Any updates ? still not able to solve the KeyError: 'playwright_page'
There is not enough information to reproduce. Please refer to https://github.com/scrapy-plugins/scrapy-playwright?tab=readme-ov-file#reporting-issues.
More Detail
I want to use my spider to get info from https://www.manamana.net/exploreauth
, I write code obey the sample, but get this error too.
And then I translate my code as this refer
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch()
page = await browser.new_page()
await page.goto("https://www.manamana.net/exploreauth")
await page.screenshot(path="example.png", full_page=True)
await browser.close()
asyncio.run(main())
Then I get the picture
But I want to get this
The reason is that there are several seconds between this two kind, I think there are some bug for playwright
to render the page which is dynamic
The reason is that there are several seconds between this two kind, I think there are some bug for playwright to render the page which is dynamic
This is precisely why the docs suggest trying standalone Playwright for debugging. You are not getting the result you expect with standalone Playwright, so the problem is not caused by scrapy-playwright.
This is not a bug though. I'd recommend ensuring the DOM is fully loaded before returning the result by passing wait_until="domcontentloaded"
to Page.goto in standalone Playwright or via the playwright_page_goto_kwargs Request meta key scrapy-playwright.
but get this error too
If by "this error" you mean the originally reported error about the a KeyError on the playwright_page
meta key, there is no Scrapy code in your to work with.
Additional information with logs
async def errback_close_page(self, failure):
print(f"===================================================== {failure.request.meta}============================")
page = failure.request.meta["playwright_page"]
print(f"xxxxxxxxxxxxxxxxxxxxx {page} is closed xxxxxxxxxxxxxxxxxxxxxxxxxxxx")
await page.close()
def start_requests(self):
if not self.start_urls and hasattr(self, "start_url"):
raise AttributeError(
"Crawling could not start: 'start_urls' not found "
"or empty (but found 'start_url' attribute instead, "
"did you miss an 's'?)"
)
for url in self.start_urls:
npo = self.npos[url]
logging.info("### crawl: %s", url)
yield scrapy.Request(
url, callback=self.my_parse, dont_filter=True, meta=dict(playwright=True, playwright_include_page=True, start_time=datetime.utcnow()), cb_kwargs={"npo": npo}, errback=self.errback_close_page
)
So here is the log with printing meta, as you can see there is no playwright_page in meta for second print
2024-06-02 21:13:05 [root] INFO: ### crawl: http://www.hhwf.org
2024-06-02 21:13:05 [root] INFO: ### crawl: http://thekauaimarathon.com
2024-06-02 21:13:05 [root] INFO: ### crawl: http://www.ohiorecruiters.org
2024-06-02 21:13:05 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:05 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:05 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:06 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy-playwright] INFO: Browser chromium launched
2024-06-02 21:13:06 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.ohiorecruiters.org/robots.txt>: DNS lookup failed: no results for hostname lookup: www.ohiorecruiters.org.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
result = context.run(
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/endpoints.py", line 1022, in startConnectionAttempts
raise error.DNSLookupError(
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.ohiorecruiters.org.
2024-06-02 21:13:06 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:07 [scrapy-playwright] INFO: Browser chromium launched
**===================================================== {'playwright': True, 'playwright_include_page': True, 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 5, 727733), 'download_timeout': 180.0, 'download_slot': 'www.ohiorecruiters.org', 'playwright_context': 'default', 'playwright_page': <Page url='chrome-error://chromewebdata/'>}============================**
xxxxxxxxxxxxxxxxxxxxx <Page url='chrome-error://chromewebdata/'> is closed xxxxxxxxxxxxxxxxxxxxxxxxxxxx
2024-06-02 21:13:08 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-02 21:13:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/playwright._impl._api_types.Error': 1,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
'downloader/request_bytes': 612,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'elapsed_time_seconds': 8.041216,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 6, 2, 21, 13, 8, 767138),
'log_count/ERROR': 1,
'log_count/INFO': 28,
'memusage/max': 175394816,
'memusage/startup': 175394816,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/persistent/False': 1,
'playwright/context_count/remote/False': 1,
'playwright/page_count': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 1,
'playwright/request_count/method/GET': 1,
'playwright/request_count/navigation': 1,
'playwright/request_count/resource_type/document': 1,
"robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 6, 2, 21, 13, 0, 725922)}
2024-06-02 21:13:08 [scrapy.core.engine] INFO: Spider closed (finished)
2024-06-02 21:13:08 [scrapy-playwright] INFO: Closing download handler
2024-06-02 21:13:08 [scrapy-playwright] INFO: Closing browser
2024-06-02 21:13:09 [scrapy-playwright] INFO: Closing download handler
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/'>
2024-06-02 21:13:11 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:11 [scrapy-playwright] INFO: Browser chromium launched
$$$$$$$$$$$$$$$$ able to access page <Page url='https://hhwf.org/'>
2024-06-02 21:13:13 [scrapy.core.engine] INFO: Closing spider (finished)
2024-06-02 21:13:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6612,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 22,
'downloader/response_bytes': 79598,
'downloader/response_count': 22,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 21,
'elapsed_time_seconds': 12.589505,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 6, 2, 21, 13, 13, 292754),
'item_scraped_count': 3,
'log_count/ERROR': 1,
'log_count/INFO': 50,
'memusage/max': 174043136,
'memusage/startup': 174043136,
'offsite/domains': 2,
'offsite/filtered': 31,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/persistent/False': 1,
'playwright/context_count/remote/False': 1,
'playwright/page_count': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 71,
'playwright/request_count/method/GET': 70,
'playwright/request_count/method/POST': 1,
'playwright/request_count/navigation': 4,
'playwright/request_count/resource_type/document': 4,
'playwright/request_count/resource_type/fetch': 2,
'playwright/request_count/resource_type/font': 9,
'playwright/request_count/resource_type/image': 12,
'playwright/request_count/resource_type/other': 2,
'playwright/request_count/resource_type/ping': 1,
'playwright/request_count/resource_type/script': 29,
'playwright/request_count/resource_type/stylesheet': 12,
'playwright/response_count': 71,
'playwright/response_count/method/GET': 70,
'playwright/response_count/method/POST': 1,
'playwright/response_count/resource_type/document': 4,
'playwright/response_count/resource_type/fetch': 2,
'playwright/response_count/resource_type/font': 9,
'playwright/response_count/resource_type/image': 12,
'playwright/response_count/resource_type/other': 2,
'playwright/response_count/resource_type/ping': 1,
'playwright/response_count/resource_type/script': 29,
'playwright/response_count/resource_type/stylesheet': 12,
'request_depth_max': 1,
'response_received_count': 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 6, 2, 21, 13, 0, 703249)}
2024-06-02 21:13:13 [scrapy.core.engine] INFO: Spider closed (finished)
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing download handler
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing browser
2024-06-02 21:13:13 [scrapy-playwright] INFO: Closing download handler
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/'>
2024-06-02 21:13:14 [scrapy-playwright] INFO: Launching browser chromium
2024-06-02 21:13:14 [scrapy-playwright] INFO: Browser chromium launched
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/support-revere'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/board'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/about-us'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/volunteer'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/cart'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/blog'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/about-us'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.paypal.com/donate/?cmd=_s-xclick&hosted_button_id=9H2AWXUE4L3M8&source=url&ssrt=1717362812122'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/back-to-school'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/2019-scholarship-awards'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/revere-hall-of-fame-induction-2019'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/community-health-and-wellness-night-2019'>
**===================================================== {'playwright': True, 'playwright_include_page': True, 'start_time': datetime.datetime(2024, 6, 2, 21, 13, 51, 897863), 'depth': 3}============================**
2024-06-02 21:13:52 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.reverefoundation.com/recent-news-1?author=51140743e4b099bd04eedd7d> (referer: https://www.reverefoundation.com/recent-news-1/2019-scholarship-awards)
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1065, in adapt
extracted = result.result()
File "/home/ec2-user/SageMaker/grant-crawler/giboo/spiders/npos.py", line 50, in errback_close_page
page = failure.request.meta["playwright_page"]
KeyError: 'playwright_page'
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/marathon-and-half-marathon/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/register/keiki-run/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/the-course/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/expo/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/race-activities/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/the-race/faqs/'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/see-something-say-something-spring-2019'>
2024-06-02 21:14:00 [scrapy.extensions.logstats] INFO: Crawled 18 pages (at 18 pages/min), scraped 77 items (at 77 items/min)
2024-06-02 21:14:00 [scrapy.extensions.logstats] INFO: Crawled 10 pages (at 10 pages/min), scraped 12 items (at 12 items/min)
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/2019'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/produce'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://www.reverefoundation.com/recent-news-1/tag/protect'>
$$$$$$$$$$$$$$$$ able to access page <Page url='https://thekauaimarathon.com/photos-and-videos/'>
The failing requests have a referrer header, they are not coming from your start_requests
method. You haven't shared a full spider, I suspect those specific requests do not have playwright_include_page=True
in their meta.
I'm closing this, if you're still having trouble and want to reopen the issue you need to include a Minimal, Reproducible Example as instructed in the Reporting issues section.
So on further analysis I was able to pin point to exact error , Forbidden by robot.txt is causing the crawler to freeze. The issue is failure requests that are forbidden by robot.txt has no playwright_page associated to them, therefore i am not able to close them using page.close which is causing my crawler to freeze where page = failure.request.meta["playwright_page"] Note
Here is the minimal reproduceable code
import re
import scrapy
import logging
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urlparse
from scrapy.linkextractors import LinkExtractor
emails_re = re.compile(r"\b([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b", re.IGNORECASE)
class GrantsSpider(scrapy.Spider):
name = "npos"
reported_links = []
link_extractor = LinkExtractor(unique=True)
npos = {}
async def errback_close_page(self, failure):
logging.error(f"Error processing {failure.request.url}: {repr(failure)}")
page = failure.request.meta.get("playwright_page")
if page:
await page.close()
await page.context.close()
logging.info(f"Page {page} closed on error")
def start_requests(self):
url = 'https://www.guidestar.org/profile/26-1700710'
npo = {"ein": "26-1700710", "name": "Example NPO", "type": "Nonprofit"}
logging.info(f"### crawl: {url}")
yield scrapy.Request(
url, callback=self.my_parse, dont_filter=True,
meta={"playwright": True, "playwright_include_page": True, 'dont_redirect': True},
cb_kwargs={"npo": npo},
errback=self.errback_close_page
)
async def my_parse(self, response, npo):
page = response.meta["playwright_page"]
self.reported_links.append(response.url)
try:
_ = response.text
except AttributeError as exc:
logging.debug(f"Skipping {response.url}: {exc}")
await page.close()
return
body, match = self.is_page(response, None)
for email in emails_re.findall(body):
yield {
"ein": npo["ein"],
"name": npo["name"],
"type": npo["type"],
"msg": "link-email",
"match": email,
"text": email,
"url": response.url,
"timestamp": datetime.utcnow(),
}
for link in response.xpath("//a"):
href = link.xpath("./@href").get()
if not href or href.startswith("javascript:") or href.startswith("#"):
continue
if not href.startswith("http"):
href = response.urljoin(href)
if href not in self.reported_links:
yield scrapy.Request(
href, callback=self.my_parse,
meta={"playwright": True, "playwright_include_page": True, 'dont_redirect': True},
cb_kwargs={"npo": npo},
errback=self.errback_close_page
)
await page.close()
def is_page(self, response, re_expression):
sel = scrapy.Selector(response)
sel.xpath("//head").remove()
sel.xpath("//header").remove()
# sel.xpath("//footer").remove()
sel.xpath("//navbar").remove()
sel.xpath("//a").remove()
body = sel.get()
bs_doc = BeautifulSoup(body, features="lxml").get_text(strip=True, separator=" ")
if not re_expression:
return bs_doc, None
if re_expression.search(bs_doc):
matches = list(set(list(re.findall(re_expression, bs_doc))[0]))
if "" in matches:
matches.remove("")
return bs_doc, matches
return None, None
Sounds reasonable, from a quick look I think this is the Scrapy exception you're getting.
I'd say that's expected behavior though: the request doesn't reach the download handler, hence no page is created. You'll just need to check in your errback if there is actually a page to close before attempting to close it. Furthermore, unless your actual spider is bigger and what you posted is a reduced example, I'd recommend you ask yourself if it's really necessary to pass playwright_include_page=True
. You don't seem to be interacting with the page besides closing it, you might be better off not passing playwright_include_page
and allowing the handler to close the page on its own. See also these docs.
The actually spider is much bigger and this is just a minimal code of it. The reason for setting playwright_include_page=True is that it gives us more control over the closing of pages, this way we can better manage memory consumption. Anyways I was able to solve this by changing my errback method a bit
async def errback_close_page(self, failure):
self.logger.error(f'Error processing page: {repr(failure)}')
if "playwright_page" in failure.request.meta:
page = failure.request.meta["playwright_page"]
if page:
await page.close()
self.logger.info(f"Closed page due to error: {page}")
raise CloseSpider(reason='Forbidden by robots.txt')
Hi @elacuesta, even after making the changes I am still facing the issue with some links where my spider gets stuck even if its closed
{'BOT_NAME': 'test',
'CLOSESPIDER_ITEMCOUNT': 300,
'CLOSESPIDER_PAGECOUNT': 300,
'CLOSESPIDER_TIMEOUT': 300,
'CONCURRENT_REQUESTS': 8,
'CONCURRENT_REQUESTS_PER_DOMAIN': 4,
'CONCURRENT_REQUESTS_PER_IP': 4,
'COOKIES_ENABLED': False,
'DEPTH_LIMIT': 2,
'DEPTH_PRIORITY': 1,
'DNS_TIMEOUT': 3,
'DOWNLOAD_DELAY': 0.3,
'DOWNLOAD_TIMEOUT': 120,
'FEED_EXPORT_ENCODING': 'utf-8',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'myConfig.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'RETRY_ENABLED': False,
'ROBOTSTXT_OBEY': True,
'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
'SPIDER_MODULES': ['myConfig.spiders'],
'TELNETCONSOLE_ENABLED': False,
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'myConfig.middlewares.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-06-26 20:30:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-06-26 20:30:37 [scrapy.core.engine] INFO: Spider opened
2024-06-26 20:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
#### Run crawlers 25040-25040
2024-06-26 20:30:38 [scrapy-playwright] INFO: Starting download handler
2024-06-26 20:30:38 [scrapy-playwright] INFO: Starting download handler
2024-06-26 20:30:43 [root] INFO: ### crawl: http://www.trinityschooloftexas.com
2024-06-26 20:30:43 [scrapy-playwright] INFO: Launching browser chromium
2024-06-26 20:30:43 [scrapy-playwright] INFO: Browser chromium launched
2024-06-26 20:30:49 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.trinityschooloftexas.com> (referer: None)
2024-06-26 20:30:49 [scrapy-playwright] INFO: Launching browser chromium
2024-06-26 20:30:50 [scrapy-playwright] INFO: Browser chromium launched
2024-06-26 20:31:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:32:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:33:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:34:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:35:38 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2024-06-26 20:35:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:36:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-06-26 20:37:38 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
I am linking the GitHub repo with minimal code which will help you replicate the issue , any help would be appreciated. Thanks https://github.com/nekender/Playwright_Issue/
Hi @elacuesta , did you get the time to have look ?
Is not clear to me what the issue is in your last report. Furthermore, what you shared is not minimal: there are multiple files, middlewares, regex processing, selector manipulation, HTML parsing with BeautifulSoup, etc, none of which are likely to be related to what you report. It is often the case that developers understand and solve issues on their own during the process of distilling a program down to the bare minimum necessary to reproduce a problem. So far this issue doesn't contain any reproducible bug report, it has always been a support matter and there are better resources to handle those (e.g. the scrapy-playwright tag at StackOverflow).
can anyone please explain why am I getting this error and how can I fix this?
Traceback (most recent call last): File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1065, in adapt extracted = result.result() File "/home/ec2-user/SageMaker/xx", line 50, in errback_close_page page = failure.request.meta["playwright_page"] KeyError: 'playwright_page'