Open PanzerFlow opened 2 years ago
Hi, thanks for the feedback!
Please can you try again with the latest version? I have tried to improve the stability of apply_link
extraction logic. Please also mind to set optimize
option to false since it could cause instability on dynamic jobs loading. This is the code I have used:
import logging
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters
logging.basicConfig(level=logging.INFO)
scraper = LinkedinScraper(
# chrome_executable_path=None,
# chrome_options=None,
headless=True,
max_workers=1,
slow_mo=0.5,
page_load_timeout=20,
)
def on_data(data: EventData):
print('\t', data.job_id, data.apply_link)
# scraper.on(Events.DATA, on_data)
query1 = Query(
query='Data Engineer',
options=QueryOptions(
locations=['Canada'],
optimize=False, # <--- Set this to false
apply_link=True,
limit=1000, # Most days the posting number is around 700.
filters=QueryFilters(
relevance=RelevanceFilters.RECENT,
time=TimeFilters.DAY,
)
)
)
scraper.run([query1])
This is the result:
python v3.9.12
ChromeDriver 102.0.5005.61
I am currently trying to scrape jobs from the below link each day. There should be around 700 jobs per day and below is the link I use to check the results.
Data Engineer in Canada
Here is my query parameter set up. I am using auth session with max_workers=1, slow_mo=0.5.
Query( query='Data Engineer', options=QueryOptions( locations=['Canada'], optimize=True, apply_link = True, # Try to extract apply link (slower because it needs to open a new tab for each job). Default to false limit=1000, # Most days the posting number is around 700. filters=QueryFilters(
relevance=RelevanceFilters.RECENT, time=TimeFilters.DAY, ) ) )
The scraper will stop around 100-200 mark and with below errors [ON_ERROR] Message: javascript error: Cannot read properties of undefined (reading 'querySelector') (Session info: headless chrome=100.0.4896.127)
After a bit it will just say there are no more jobs