spinlud / py-linkedin-jobs-scraper

MIT License
365 stars 101 forks source link

Is there a way to not scrape jobs from 'Worldwide'? #33

Open halakhah opened 2 years ago

halakhah commented 2 years ago

Hiya,

I need the job titles I scrape to only be in English. When I run the program, I get results from Worldwide, which gives me job titles that are not in English.

Not sure how to specify to not search in the Worldwide region.

Here's my code

import logging
import csv
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData, EventMetrics
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters, RemoteFilters

# Change root logger level (default is WARN)
logging.basicConfig(level = logging.INFO)

job_data = []

# Fired once for each successfully processed job
def on_data(data: EventData):
    job_data.append([data.title, data.company])

# Fired once for each page (25 jobs)
def on_metrics(metrics: EventMetrics):
  print('[ON_METRICS]', str(metrics))

def on_error(error):
    print('[ON_ERROR]', error)

def on_end():
    print('[ON_END]')

scraper = LinkedinScraper(
    chrome_executable_path='/Users/voi/chromedriver', # Custom Chrome executable path (e.g. /foo/bar/bin/chromedriver) 
    chrome_options=None,  # Custom Chrome options here
    headless=True,  # Overrides headless mode only if chrome_options is None
    max_workers=2,  # How many threads will be spawned to run queries concurrently (one Chrome driver for each thread)
    slow_mo=3,  # Slow down the scraper to avoid 'Too many requests 429' errors (in seconds)
    page_load_timeout=25  # Page load timeout (in seconds)  
)

# Add event listeners
scraper.on(Events.DATA, on_data)
scraper.on(Events.ERROR, on_error)
scraper.on(Events.END, on_end)

queries = [
    Query(
        options=QueryOptions(   
            optimize = True,  
            limit=2000  # Limit the number of jobs to scrape.            
        )
    ),
    Query(
        query='',
        options=QueryOptions(
            locations=['United States', 'California', 'Texas','New York', 'Michigan'],            
            apply_link = False,  # Try to extract apply link (easy applies are skipped). Default to False.
            limit=500,
            filters=QueryFilters(          

                relevance=RelevanceFilters.RECENT,
                time=TimeFilters.ANY,
                type=[TypeFilters.FULL_TIME, TypeFilters.INTERNSHIP, TypeFilters.PART_TIME],
                experience=None,                
            )
        )
    ),
]

scraper.run(queries)
spinlud commented 2 years ago

Hi, Worldwide is the default search location when you does not provide any esplicit location in the query options. The query

Query(
        options=QueryOptions(   
            optimize = True,  
            limit=2000  # Limit the number of jobs to scrape.            
        )
    )

does not specify any location, so the search will be done worldwide. Usually if you search jobs in USA or UK they are written in english but this is not always the case (e.g. a chinese job in New York could be written in chenese...).