joeyism / linkedin_scraper

A library that scrapes Linkedin for user data
GNU General Public License v3.0
2.07k stars 575 forks source link

Object cleanup #71

Open ronimosh opened 3 years ago

ronimosh commented 3 years ago

I am trying to use the scraper to scrape multiple person profiles in succession. I am able to get the first profile right, but then when I get the second profile, it seems to be ADDED to the first one, instead of a clean object. I tried to delete the person object between iterations, but for some reason I still get the data of the second profile in addition to the first. For instance - suppose profile 1 had experience in company A, B and C , and person 2 had experience in company D and E, after the second iteration I get the the history like A, B,C, D,E instead of just D and E. Same goes for most other fields.


driver = webdriver.Chrome() actions.login(driver, email, password)

while(whatever):

line = f.readline()

person_id, person_url = line.strip().split(" ")

person = Person(person_url, driver=driver,scrape=False)
person.scrape(close_on_complete=False)

outfile_name="linkedin_data_"+str(person_id)
outfile = open(outfile_name, "w")

outfile.write  ("--------------\nAbout:\n")
for ab in person.about:
    outfile.write  (ab.__repr__() + " ")

outfile.write  ("--------------\nExpereience\n")
i=0
for xp in person.experiences:
    outfile.write  (str( i) + ") " + xp.__repr__() + "\n")
    i=i+1

del person
outfile.close()

Any ideas why the person object in the second iteration "includes" the first one ? How to resolve this ?

ronimosh commented 3 years ago

Managed to overcome this by "hard" initialization of the parameters when initializing the Person class, Not sure why this works since the defaults should have done the same trick.

joeyism commented 3 years ago

I don't know why that happens. If you find a fix for it, feel free to submit a MR

lukaskoeni commented 3 years ago

Managed to overcome this by "hard" initialization of the parameters when initializing the Person class, Not sure why this works since the defaults should have done the same trick.

Hi there, I have the same issue right now. Even after I actively delete the person instance, I end up getting the prior scraping results in the following loop iteration as described here.

joeyism commented 3 years ago

@lukaskoeni can you give me the code that you're running? I'll use that as a reference to try and reproduce the error

lukaskoeni commented 3 years ago

@joeyism Here is a boiled down version of my code

from linkedin_scraper import Person, actions
from selenium import webdriver
from selenium.webdriver.common.by import By    
from selenium.webdriver.support.ui import WebDriverWait    
from selenium.webdriver.support import expected_conditions as EC    
from selenium.common.exceptions import TimeoutException 
import time

## load chrome
driver = webdriver.Chrome('path/to/driver')
actions.login(driver, "mail", "pswd") # if email and password isnt given, it'll prompt in terminal

## load list of urls

url_list = ["ttps://www.linkedin.com/in/sebastian-lewis-4a81032b/", "https://www.linkedin.com/in/mina-samaan-95927886"]

## create empty dic for results
url_edu_dic = dict()

## start loop for scrape 
    for url in url_list:
        ## Create empty list for universities
        edu_li = []
        print("Now scraping the following URL:", url)
        try:
            person = Person(linkedin_url, driver = driver, scrape=False)    
            person.scrape(close_on_complete=False)
            for item in range(len(person.educations)):
                edu_li.append(person.educations[item].institution_name)
        except:
            print("Something went wrong")
            edu_li = ["Error at scraping!"]
        else:
            print("Nothing went wrong")
        print(edu_li)
        ## Write scraped data to url_edu_dic
        url_edu_dic[url] = edu_li
        ## delete current instance of "person"
        del person
        del edu_li
        print("Short Break 60sec before the next person.")
        time.sleep(60)
        print("Continue")

If I hard set eduction=[] the code works as expected, if I leave it out it doesn't, even though I delete the object after every iteration.

joeyism commented 3 years ago

this is my modified version of your code

import os
from linkedin_scraper import Person, actions
from selenium import webdriver
from selenium.webdriver.common.by import By    
from selenium.webdriver.support.ui import WebDriverWait    
from selenium.webdriver.support import expected_conditions as EC    
from selenium.common.exceptions import TimeoutException 
import time

driver = webdriver.Chrome('./chromedriver')
actions.login(driver, os.getenv("LINKEDIN_USER"), os.getenv("LINKEDIN_PASSWORD"))

url = "https://www.linkedin.com/in/sebastian-lewis-4a81032b/"

print("Now scraping the following URL:", url)
person = Person(url, driver = driver, scrape=False)    
person.scrape(close_on_complete=False)

and this is the result Screenshot from 2021-01-27 21-04-05

ronimosh commented 3 years ago

Here's the jist of my code. I managed to overcome the problem of data retained between calls by hard clearing the array and deleting the object between calls.

from linkedin_scraper import Person, actions from selenium import webdriver import requests as re import time import sys

driver = webdriver.Chrome() actions.login(driver, email, password)

line = "" while True :

read next line

rec_count += 1

line = f.readline() if (len(line)==0): print ("We're done !") break try: person = Person(person_url, about=[], experiences=[], educations=[], interests=[], accomplishments=[], company=None, job_title=None, contacts=[], driver=driver,scrape=False) person.scrape(close_on_complete=False) except: print ("Person #" + str(rec_count) + " Person #" + str(rec_count) + "

f.close()

However, I still come across another problem which I did not manage to solve or workaround. I am trying to scrape 100s of profiles. but after somewhere between 10-50 profiles, I start getting empty responses. I verified the profiles in reality are not empty of course. After I start getting these empty profiles back, usually all subsequent calls also return empty. When this happens, killing & restarting the program often solves the problem. After restarting the program and fetching the same profile, I usually get the "complete" profile. This indicates a code problem and not something different on the Linkedin profile returned. This means I need to be constantly monitoring the execution to see when I start receiving the empty profiles.

Any ideas?

On Mon, Jan 18, 2021 at 4:10 PM lukaskoeni notifications@github.com wrote:

Managed to overcome this by "hard" initialization of the parameters when initializing the Person class, Not sure why this works since the defaults should have done the same trick.

Hi there, I have the same issue right now. Even after I actively delete the person instance, I end up getting the prior scraping results in the following loop iteration as described here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/joeyism/linkedin_scraper/issues/71#issuecomment-762274241, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAN5OGERTWWHP25R2VU6D3S2Q6MXANCNFSM4URYUY3Q .

lukaskoeni commented 3 years ago

Here's the jist of my code. I managed to overcome the problem of data retained between calls by hard clearing the array and deleting the object between calls. ----------- from linkedin_scraper import Person, actions from selenium import webdriver import requests as re import time import sys driver = webdriver.Chrome() actions.login(driver, email, password) line = "" while True : # read next line rec_count += 1 line = f.readline() if (len(line)==0): print ("We're done !") break try: person = Person(person_url, about=[], experiences=[], educations=[], interests=[], accomplishments=[], company=None, job_title=None, contacts=[], driver=driver,scrape=False) person.scrape(close_on_complete=False) except: print ("Person #" + str(rec_count) + " Person #" + str(rec_count) + " - Profile not found for person ID " + person_id) continue # handle current position field currently = " " if (person.job_title is not None): currently = person.job_title if (person.company is not None): currently += " at " + person.company # handle About field about = " " for ab in person.about: about += ab.repr() + " " outfile.write(ab.repr() + " ") # handle Experiences field i = 0 exp = " " for xp in person.experiences: exp += str(i) + ") " + xp.repr() + "\n" outfile.write(str(i) + ") " + xp.repr() + "\n") i = i + 1 del person f.close() ------------ However, I still come across another problem which I did not manage to solve or workaround. I am trying to scrape 100s of profiles. but after somewhere between 10-50 profiles, I start getting empty responses. I verified the profiles in reality are not empty of course. After I start getting these empty profiles back, usually all subsequent calls also return empty. When this happens, killing & restarting the program often solves the problem. After restarting the program and fetching the same profile, I usually get the "complete" profile. This indicates a code problem and not something different on the Linkedin profile returned. This means I need to be constantly monitoring the execution to see when I start receiving the empty profiles. Any ideas?

You‘re most likely running into anti scraping provisions by LinkedIn. Try adding at least 60sec of waiting between scraping two consecutive profiles.

ronimosh commented 3 years ago

Will try. I am adding 10 seconds after every 10 profiles, but I will try increasing it.

Thanks!

On Sat, Jan 30, 2021 at 2:52 PM lukaskoeni notifications@github.com wrote:

Here's the jist of my code. I managed to overcome the problem of data retained between calls by hard clearing the array and deleting the object between calls. ----------- from linkedin_scraper import Person, actions from selenium import webdriver import requests as re import time import sys driver = webdriver.Chrome() actions.login(driver, email, password) line = "" while True : # read next line rec_count += 1 line = f.readline() if (len(line)==0): print ("We're done !") break try: person = Person(person_url, about=[], experiences=[], educations=[], interests=[], accomplishments=[], company=None, job_title=None, contacts=[], driver=driver,scrape=False) person.scrape(close_on_complete=False) except: print ("Person #" + str(rec_count) + " Person #" + str(rec_count) + " - Profile not found for person ID " + person_id) continue # handle current position field currently = " " if (person.job_title is not None): currently = person.job_title if (person.company is not None): currently += " at " + person.company # handle About field about = " " for ab in person.about: about += ab.repr() + " " outfile.write(ab.repr() + " ") # handle Experiences field i = 0 exp = " " for xp in person.experiences: exp += str(i) + ") " + xp.repr() + "\n" outfile.write(str(i) + ") " + xp.repr()

  • "\n") i = i + 1 del person f.close() ------------ However, I still come across another problem which I did not manage to solve or workaround. I am trying to scrape 100s of profiles. but after somewhere between 10-50 profiles, I start getting empty responses. I verified the profiles in reality are not empty of course. After I start getting these empty profiles back, usually all subsequent calls also return empty. When this happens, killing & restarting the program often solves the problem. After restarting the program and fetching the same profile, I usually get the "complete" profile. This indicates a code problem and not something different on the Linkedin profile returned. This means I need to be constantly monitoring the execution to see when I start receiving the empty profiles. Any ideas? … <#m4972146187596408475>

You‘re most likely running into anti scraping provisions by LinkedIn. Try adding at least 60sec of waiting between scraping two consecutive profiles.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/joeyism/linkedin_scraper/issues/71#issuecomment-770208030, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAN5ODAOOENODEWPDOZI6DS4P6KDANCNFSM4URYUY3Q .

mgutmann commented 3 years ago

Will try. I am adding 10 seconds after every 10 profiles, but I will try increasing it. Thanks!

Your solution worked for me too. Also, currently I am using a random 30 to 60 second delay and I don't seem to get empty responses.

swagluke commented 3 years ago

Having similar issues with the object clean up as well. Tried @ronimosh approach to hard reset all parameters, which help to get rid of the previous person profile but I see a lot more empty fields on the scraping results now. For example, About, Experience, Interests are just showing as empty arrays. @joeyism Your modified version code didn't work, if you run another person scarping after, you will see the same duplicate results from pervious run.

Update on my earlier comment, I should have read the previous comments more carefully. LinkedIn does run a anti scraping provision somehow, it doesn't return any error such as 429. But it will just return empty pages, therefore our Linkedin_scaper couldn't find anything to scrape. Set a sleep timer for 10 sec seems to help for each iteration.