Closed EthanZ1996 closed 2 years ago
I have received the same message when trying to get into the next few pages of a url. I'll provide some further information on my approach here:
I'm building a scraper that goes into the links for each post and then gets the next page, and keeps doing this then finally grabs the info from that page it linked too.
import hashlib
import logging
from pathlib import Path
from typing import Generator, Optional
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.http.response import Response
import scrapy
import logging
from scrapy_playwright.page import PageCoroutine
cookies = {
'VISITOR_ID': '3c553849d1dc612f60515f04d9316813',
'INEU': '1',
'PJBJOBSEEKER': '1',
'LOCATIONJOBTYPEID': '3079',
'AnonymousUser': 'MemberId=c3e94bc3-3fcf-423d-bc25-e5a5818cd2b9&IsAnonymous=True',
'visitorid': '46315422-d835-4a38-b428-4b9c5d6243d3',
's_fid': '6468EA5E39AF374B-2F7C971BB196D965',
'sc_vid': '7c12948068d2cb92c1f1622aeaabc62d',
'listing_page__qualtrics': 'empty',
'SsaSessionCookie': 'fea6276c-cee9-43b3-9d57-9f00d6bcd32b',
's_cc': 'true',
'SessionCookie': '1d7806bc-4b79-47e9-839d-2d94ec224abb',
'FreshUserTemp': 'https://www.jobsite.co.uk/',
'bm_mi': '6BF6AA183A047F87BAC664C92ACA8E41~1Fku4TDwEBxz2+fwhUGUWjUhP3vaQED08Ala3VmmARyewb9/OjQUmvPEWw88MUA7USOzt+0MSpdyPmY/3N+iY08InyOy4DnNHgTq88AWwBigf1XhufLstD/eUhUJBgXQRSa1rVlO5SB5mlkhezcDRmv8bL+Gt4NZdsjVC4ZlVc3ptkbKY9cBB65yW2tyjZBLtxsQnz/rFJXo4a9PTKOvF/Betnb8S/XQrpNDsXOojdhtQrrU9V6XSziX+tHXT6xj1osB8XQtm0VGC7L6+4+bgQ==',
'gpv_pn': '%2FJobSearch%2FResults.aspx',
'ak_bmsc': '77506B6768E0463D238EEE24AE5B3A72~000000000000000000000000000000~YAAQFsITAnaZtJ99AQAA4LdCLw4PU4xFjE3/FbxxIG7pSjNqX9TClutWaS1MLKKy/9hAM9d6bcEN5Mr9Fbb8+1Jy3rrCsFO5TvxstcVAjaGbbvDCF/mXxeqJQAU1h/cvrZEH68FZyDuslnE+Ae7DuCs1QmNkNP6+0dvA4GT+/MENayQQk8szCo8ch3IfCK1j5/JL+jjbb04pmnpibV3XvUcLeqTJMY1IG9PlTuBIFWF8gXREI+ug2bb8pL+r7T1v1s9gVmfo633B0BoVcXIfWcDgtyFJjFNVayz2lHxUdtnInaWvi1ubzsjQ7cfUDdHTorHsJ0rP1RXB0utZ80GIBNbGdAzd1jkWy9BMIqdIcbBXM4+rCf3fbPw+qui+0Sr4RIxM5N41mvrOQ6W8s9bPR7GySeJr/2HGSmxTjf+4QDVY',
'TJG-Engage': '1',
'CONSENTMGR': 'c1:0%7Cc2:0%7Cc3:0%7Cc4:0%7Cc5:0%7Cc6:0%7Cc7:0%7Cc8:0%7Cc9:1%7Cc10:0%7Cc11:0%7Cc12:0%7Cc13:0%7Cc14:0%7Cc15:0%7Cts:1641470409597%7Cconsent:true',
'utag_main': 'v_id:017e24e57d970023786b817ac51005079001e071009e2_sn:16$_se:5$_ss:0$_st:1641472209641$ses_id:1641470390747%3Bexp-session$_pn:3%3Bexp-session$PersistedFreshUserValue:0.1%3Bexp-session$PersistedClusterId:OTHER--9999%3Bexp-session',
's_ppvl': '%2FJobSearch%2FResults.aspx%2C13%2C13%2C741%2C409%2C741%2C1600%2C900%2C2%2CL',
's_ppv': '%2FJobSearch%2FResults.aspx%2C100%2C13%2C6616%2C423%2C741%2C1600%2C900%2C2%2CL',
's_sq': 'stepstone-jobsite-uk%3D%2526c.%2526a.%2526activitymap.%2526page%253D%25252FJobSearch%25252FResults.aspx%2526link%253DNext%2526region%253Dapp-unifiedResultlist-db2486f4-fb7d-469f-8cfd-f31a3eafb692%2526pageIDType%253D1%2526.activitymap%2526.a%2526.c%2526pid%253D%25252FJobSearch%25252FResults.aspx%2526pidt%253D1%2526oid%253Dhttps%25253A%25252F%25252Fwww.jobsite.co.uk%25252Fjobs%25253Fpage%25253D3%252526action%25253Dpaging_next%2526ot%253DA',
'bm_sv': '4C178898519D2A4ADEBB840C0B682999~sanqWSDI/ZT0KWrdWhNRc7UtVtqAZ61oPSoLv/MnCD1e0a7vUTSzpggIj9dt/bN4nXEmOaM48hugBFRwdBveJlobrjEcMZ1gHS3S3KXYaHfZPjq6IIf8/Fs1QUlg0s7oLp6DsZbkAWWOnNQiI/uaq7XT7EHnd+n/46ra5jgwfhA=',
'_abck': '508823E0A454CEF8D6A48101DB66BDB8~0~YAAQFsITAjWdtJ99AQAAr21ELwcAj2rhoIgnvsHOkxQPuREHCA9mDMHsyk68FBhxQ0Jto+6FqaEHJkrrVEUGuYveQAjVJ7CGS+2ajmbcVkG/KIQn8ttCaGvn58jkwzpWm6Fjx4FsLBJyLsceRWSqw5rV2ezEeLrBd/ZToRMpdZop4yqixh5vquandn+h9ysqacaeHPO90VnvctIfvKTUvY5GrrHubGVMkD9/elxRI5whsBdH7ovATyGsLEgYx+e604lY2sQIahSvweclTI4Ud1hTQbSQTebWs52PiYdSU5wq9+YC/7Sr0JuQZCUMyGGqZgtXpfAdc9LDa8X3JfcdO25EZQHxsfEfT/pp7tjbxaXD/pgun9ozymRMy/hBuCj5/Bfln/LzAqOsdDv7q6WVerNr6qivHGDE0m2/~-1~-1~-1',
'bm_sz': '747D15CEB59AC2C0003BD8479C4BF482~YAAQFsITAjadtJ99AQAAr21ELw5vDH+lMq9NICfxNXHGiXcPcBSrWov2Hy8Y0wgN/OAL7NJWfJ7Lkum/OqG3WNj9/+e8oJhNRQ96ksn+zk0N0gNnoPhUv46am0wktHih1PPfYRlqdPSQSdgE92eHwG3CsFaSeRROKu/1q89aNDH4+JBUk/TDdTmeBqsvJffzvP0S1gAv54dOecx0z2LSW6PEj0e0VtWqmBjFSQxCqH8LZ4r7TwqDpxAKArzWGDMlqR/xZcWvAm8ijUTG+mIuF3N7aBEDGdB90wdyaJGt2CGP3VinhBNtxV7vT8ebY9oWu2rJ+UmugGgJ/dasQP8=~3424837~3354936',
'EntryUrl': '/jobs?page=3&action=paging_next',
'SearchResults': '96094529,96094530,96094528,96094527,96094526,96094525,96094524,96094522,96094521,96094520,96094519,96094517,96094518,96094514,96094513,96094509,96094510,96094511,96094508,96094507,96094506,96094503,96094502,96094500,96094499',
}
headers = {
'authority': 'www.jobsite.co.uk',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'accept': 'application/json',
'content-type': 'application/json',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'sec-ch-ua-platform': '"macOS"',
'origin': 'https://www.jobsite.co.uk',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.jobsite.co.uk/jobs?page=3&action=paging_next',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
class JobSpider(scrapy.Spider):
name = 'job_pages'
start_urls = ['https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance']
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback = self.parse,
dont_filter = True,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine('wait_for_selector', 'div.row.job-results-row')
]
)
)
def parse(self, response):
stuff = response.xpath("//div[@class='ResultsSectionContainer-sc-gdhf14-0 kteggz']/div[@class='Wrapper-sc-11673k2-0 gIBPSk']")
for items in stuff:
for jobs in items.xpath('//article//div//div[position() mod 7 = 6]/a//@href'):
yield response.follow(
jobs,
callback = self.parse_jobs,
meta={
"playwright": True,
"playwright_include_page": True})
next_page = response.xpath('(//div)[position() mod 5=3][83]/a[2]//@href').get()
if next_page:
yield scrapy.Request(
url = next_page,
callback = self.parse,
meta=dict(
playwright= True,
playwright_include_page= True,
playwright_page_coroutines=[PageCoroutine('wait_for_selector', 'div.row.job-results-row')]
)
)
async def parse_jobs(self, response):
url_sha256 = hashlib.sha256(response.url.encode("utf-8")).hexdigest()
page = response.meta["playwright_page"]
await page.screenshot(
path=Path(__file__).parent / "job_test" / f"{url_sha256}.png", full_page=True
)
await page.close()
yield {
"url": response.url,
"title": response.xpath("//h1[@class='brand-font']//text()").get(),
"price": response.xpath("//li[@class='salary icon']//div//text()").get(),
"organisation": response.xpath("//a[@id='companyJobsLink']//text()").get(),
"image": f"job_test/{url_sha256}.png",
}
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"CLOSESPIDER_ITEMCOUNT": 100,
"FEED_URI":'jobs.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(JobSpider)
logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)
logging.getLogger("scrapy.core.scraper").setLevel(logging.WARNING)
process.start()
Here's the error output:
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
2022-01-06 15:23:14 [scrapy-playwright] INFO: Closing browser
2022-01-06 15:23:14 [scrapy-playwright] INFO: Closing browser
2022-01-06 15:23:14 [scrapy-playwright] DEBUG: Browser context closed: 'default'
Please, provide a minimal, reproducible example (the provided code sample is hardly minimal).
@elacuesta
It's been a while however I remember clearly that there was an error with my script as opposed to scrapy_playwright
. It's a long-shot, but I presume the author of the post likely has a similar issue. In that their script may be problem.
Upon closer inspection this seems like a duplicate of #15, which I'm aiming to solve at #74. Feel free to reopen with more information if that's not the case.
I would suggest defining a request errback or a spider middleware with a process_spider_exception
method to recover from these errors.
Hi, elacuesta,
I use your handler in my Scrapy and it runs well and can crawl the information I need. However, some error occurs before the process in item and pipeline. Here is an example:
Sometimes, this error occurs 5 or 6 times per running, but I also met the situation that no errors. In addition, the difference among these errors are the numbers of the task, that is
Task-180
,Task-181
,Task-182
, et al.I guess the error is about the coroutine or asynsio but I am not familiar with them. Do you know what is going on? Do need to change any settings? Thanks! BTW, I am using VM, ubuntu 20.04 on Windows 10.
Regards, Ethan