spinlud / py-linkedin-jobs-scraper

MIT License
307 stars 84 forks source link

Scraper randomly cuts off and miss records #28

Open PanzerFlow opened 2 years ago

PanzerFlow commented 2 years ago

I am currently trying to scrape jobs from the below link each day. There should be around 700 jobs per day and below is the link I use to check the results.

Data Engineer in Canada

Here is my query parameter set up. I am using auth session with max_workers=1, slow_mo=0.5.

Query( query='Data Engineer', options=QueryOptions( locations=['Canada'], optimize=True, apply_link = True, # Try to extract apply link (slower because it needs to open a new tab for each job). Default to false limit=1000, # Most days the posting number is around 700. filters=QueryFilters(
relevance=RelevanceFilters.RECENT, time=TimeFilters.DAY, ) ) )

The scraper will stop around 100-200 mark and with below errors [ON_ERROR] Message: javascript error: Cannot read properties of undefined (reading 'querySelector') (Session info: headless chrome=100.0.4896.127)

After a bit it will just say there are no more jobs

spinlud commented 2 years ago

Hi, thanks for the feedback! Please can you try again with the latest version? I have tried to improve the stability of apply_link extraction logic. Please also mind to set optimize option to false since it could cause instability on dynamic jobs loading. This is the code I have used:

import logging
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters

logging.basicConfig(level=logging.INFO)

scraper = LinkedinScraper(
    # chrome_executable_path=None,
    # chrome_options=None,
    headless=True,
    max_workers=1,
    slow_mo=0.5,
    page_load_timeout=20,    
)

def on_data(data: EventData):    
    print('\t', data.job_id, data.apply_link)

# scraper.on(Events.DATA, on_data)

query1 = Query(
    query='Data Engineer',
    options=QueryOptions(
        locations=['Canada'],
        optimize=False,  # <--- Set this to false
        apply_link=True,
        limit=1000,  # Most days the posting number is around 700.
        filters=QueryFilters(
            relevance=RelevanceFilters.RECENT,
            time=TimeFilters.DAY,
        )
    )
)

scraper.run([query1])

This is the result:

image

python v3.9.12 ChromeDriver 102.0.5005.61