joeyism / linkedin_scraper

A library that scrapes Linkedin for user data
GNU General Public License v3.0
1.97k stars 552 forks source link

How do I write into text file or csv file? #7

Closed alvinjchoi closed 6 years ago

alvinjchoi commented 6 years ago

Here's what I'm doing... driver - webdriver.Chrome() person = Person("http://www.linkedin.com/in/randomperson", driver = driver, scrape=False) f = open('scrape.txt', 'w') f.write(person.scrape())

But I'm getting an error.. TypeError: write() argument must be str, not None How do I convert person object into string and append into text file, or better yet, csv file?

alvinjchoi commented 6 years ago

To give you an idea of what I'm doing, I have around 400 linkedin URLs and want to scrape their name, location, current company, current position, etc.

joeyism commented 6 years ago

There's nothing that comes out of the box. If you don't mind saving it as json, you could nest in experiences and educations, get the dict of those, and combine that with the dict of the person.

For example

d = person.__dict__.copy()
del d["driver"]
d["experiences"] = [ experience.__dict__ for experience in person.experiences]
d["educations"] = [ education.__dict__ for education in person.educations]

import json; json.dump(d, open("filename.json", "w+"))

I don't know what the structure would look like in CSV, as the data is nested and the number of experiences and educations may vary.

alvinjchoi commented 6 years ago
screen shot 2018-04-08 at 4 02 14 pm

So.. I'm wondering if there's a way to parse through multiple urls in the linkedin_urls.txt and keep scraping. Ignore the sleep timer, the numbers are very random and was testing a few cases. I'm a very beginner in python, so it's a bit challenging. I imported time method thinking maybe I'll need some time to give it to load the page.

But here's an error that I'm getting:

screen shot 2018-04-08 at 4 03 59 pm
joeyism commented 6 years ago

Your error came about because you haven't logged in, and new Linkedin policy requires you to login for a lot of profiles.

You can login first, then loop through your file without closing browser. That way, you don't have to re-login every time

from selenium import webdriver
from linkedin_scraper import Person

driver = webdriver.Chrome()
driver.get("http://www.linkedin.com")

# put a breaker here via input or something
# you must login here

all_users = {}
fp = open("linkedin_url.text", "r")
for line in fp.readlines():
    person = Person(line, driver = driver, scrape=False)
    person.scrape(close_on_complete=False)

    d = person.__dict__.copy()
    del d["driver"]
    d["experiences"] = [ experience.__dict__ for experience in person.experiences]
    d["educations"] = [ education.__dict__ for education in person.educations]
    all_users[person.name] = d #saves it all to one giant dict

import json; json.dump(all_users, open("filename.json", "w+"))

As a side note, it'd be a bit easier if you pasted your code in, instead of taking a screenshot.

alvinjchoi commented 6 years ago

Hmm, I'm actually logged in but not sure why it's not working.

from linkedin_scraper import Person
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.linkedin.com")

# I'm logging into Linkedin here, then pressing enter afterwards, and the code continues
# Currently, after opening 'fp', it opens the profile that corresponds to the first line of the text file, 
# which is exactly what I want.
input("Press Enter to continue...")

all_users = {}
fp = open ("linkedin_url.txt", "r")
for line in fp.readlines():
    person = Person(line, driver = driver, scrape = False)
    # I added this statement below to identify exactly where my error is coming from. Even with linkedin logged in,
    # and with the profile that I want, when I press enter here to scrape, it throws an error, 
        # same as the last time.
    input("Press Enter to scrape...")
    person.scrape(close_on_complete=False)

    d = person.__dict__.copy()
    del d["driver"]
    d["experiences"] = [ experience.__dict__ for experience in person.experiences]
    d["educations"] = [ education.__dict__ for education in person.educations]
    all_users[person.name] = d
import json; json.dump(all_users, open("alumni.json", "w+"))

Also, my apologies for the screenshot. You are the best, this is helping me so much!!

screen shot 2018-04-08 at 5 25 52 pm
joeyism commented 6 years ago

I updated it with a bugfix, and it's published in 2.1.1. Update your linkedin_scraper and try it again

swissbeats93 commented 6 years ago

Following all the above steps, this is what results. screen shot 2018-07-31 at 11 08 14