spinlud / py-linkedin-jobs-scraper

MIT License
307 stars 84 forks source link

Connection pool error #37

Open ErpmeDerp opened 1 year ago

ErpmeDerp commented 1 year ago

Hi all, first of all, hats off for this piece of code. Very useful.

I am getting the following error while running the authenticated version (anonymous version seems to run fine for the results viewable).

INFO:li:scraper:('[data engineer][european union]', 'Opening https://www.linkedin.com/jobs/search?keywords=data+engineer&location=european+union&sortBy=DD&f_TPR=r2592000&f_JT=F&f_E=1&start=0')
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 127.0.0.1. Connection pool size: 1
WARNING:li:scraper:('[data engineer][european union]', 'Error in response', 'https://www.linkedin.com/jobs/search/?currentJobId=3159886155&f_E=1&f_JT=F&f_TPR=r2592000&keywords=data%20engineer&location=european%20union&sortBy=DD', 'request_id=63028.200 status=404 type=XHR mime_type=application/vnd.linkedin.normalized+json+2.1 url=https://www.linkedin.com/voyager/api/voyagerMessagingDashAwayStatus')
WARNING:li:scraper:('[data engineer][european union]', 'No jobs found, skip')
[ON_END]

Does anyone know what the issue could be here? This is the code I am using:

import logging
import csv
from sys import maxsize
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData, EventMetrics
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters, RemoteFilters

# Change root logger level (default is WARN)
logging.basicConfig(level = logging.INFO)

job_data = []

# Fired once for each successfully processed job
def on_data(data: EventData):
    job_data.append([data.title, data.company, data.place, data.date, data.link])

# Fired once for each page (25 jobs)
def on_metrics(metrics: EventMetrics):
  print('[ON_METRICS]', str(metrics))

def on_error(error):
    print('[ON_ERROR]', error)

def on_end():
    print('[ON_END]')

scraper = LinkedinScraper(
    chrome_options=None,  # You can pass your custom Chrome options here
    headless=False,
    max_workers=1,  # How many threads will be spawn to run queries concurrently (one Chrome driver for each thread)
    slow_mo=2,  # Slow down the scraper to avoid 'Too many requests (429)' errors
    page_load_timeout=25 # Page load timetout (in seconds)
)

# Add event listenerspy e
scraper.on(Events.DATA, on_data)
scraper.on(Events.ERROR, on_error)
scraper.on(Events.END, on_end)

queries = [
    Query(
        query='data engineer',
        options=QueryOptions(
            locations=['european union'],
            apply_link=False, 
            optimize=False,
            limit=100,
            filters=QueryFilters(
                relevance=RelevanceFilters.RECENT,
                time=TimeFilters.MONTH,
                type=[TypeFilters.FULL_TIME],
                experience=[ExperienceLevelFilters.INTERNSHIP],
            )
        )
    ),
]

scraper.run(queries)

fields = ['Job', 'Company', 'Place', 'Date', 'Link']
rows = []

for x in job_data:
    i = 0
    rows.append([x[0], x[1], x[2], x[3], x[4]])
    i = i + 1

with open('jobs_data.csv', 'w') as f:
    write = csv.writer(f)

    write.writerow(fields)
    write.writerows(rows)
Joko75 commented 1 year ago

Same error here with authenticated session. Running it with headless=False I can see the jobs on the page, but for some reason I keep getting no job found error. It was running properly until yesterday evening (08/08/2022)

VincentChanLivAway commented 1 year ago

Same error here. Have you guys tried to use another account's cookie? I did. Got correct job results with exactly the same URL.

Think LinkedIn just applied some anti-scrapper code to block certain sessions, including yours and mine. But, they just block the sessions from selenium, not from the actual Chrome browser. (Because the same blocked account can see jobs from visiting the URL in the browser.)

So I think if anyone can fix this by "simulating" the exact same browser behavior from selenium?

leonpawelzik commented 1 year ago

Same for me. As OP said - thank you for the code. Gamechanger for my job search. Anonymous mode works somewhat - but once I launch the authenticated sessions, it successfully logs in, but once the results are visible, the bot times out.

spinlud commented 1 year ago

It seems Linkedin has changed css class for one of their html elements. Try latest version and see if this fixes the No jobs found, skip issue.

Joko75 commented 1 year ago

Seems to work fine now, thanks @spinlud !

PARODBE commented 1 year ago

In my case the session cookie is invalid??? I don't understand:

image

I select li_at....