linkedtales / scrapedin

LinkedIn Scraper (currently working 2020)
Apache License 2.0
597 stars 174 forks source link

Anybody have any tips for scraping a massive number of profiles? #87

Open Aditya94A opened 4 years ago

Aditya94A commented 4 years ago

Linked is definitely one of the hardest websites to scrape. Would love to hear from someone with experience scraping linkedin at scale (Say, for a use case of scraping hundreds of thousands of profiles every week)

What typical rate limits do you set? How many accounts do you use? Are they all free or paid? What precautions do you take to avoid getting banned/IP-blocked/auto-logged out etc.?

rvvvt commented 4 years ago

I do lots of LI scraping. there are some "best practices" as outlined here by Phantombuster, however, I have not really stuck to these limits nor paid any attention to them. at most, i get the "whoaa you do a lot of searches, pal" warning, but thats it. I've never counted how many it takes to get the warning either.

But in any case, the more natural you make it look, the better in my experience. I have a non-premium account that i regularly send hundreds of customized connection requests out from, using a bot i wrote. The bot uses selenium and only manual actions i take are logging in. Then the bot handles the rest. The account is still going strong and its been 2+ years of me literally abusing the option of sending connection requests with personalized notes.

If you are afraid of getting an account banned, just make several accounts to do your scraping. I personally have never used proxies with linkedin as i find they usually make matters worse (at least with other social media), but, if you must use a proxy, try rolling a shadowsocks proxy on linode. its cheap and easy to set up and the proxy works really well for almost all other scraping uses that i throw at it.

hope this helps some!

Aditya94A commented 4 years ago

@architekco That's more like connection request automation for a single account and not scraping. Scraping would be downloading the information of a large volume of accounts, continuously and repeatedly.

The difference would be making on the order of a few million requests each week/month as opposed to a few hundred 😄

rvvvt commented 4 years ago

@AdityaAnand1 yes, i understand - it was more of a general model to take some hints from i suppose. and i understand the difference between scraping and sending connection requests :) i guess what i was trying to convey was, there isnt really a "perfect" way to do it or any known hard limits. what would be the point? linkedin has got to keep us guessing otherwise we would all scrape under the radar and have a field day... as long as we stuck to their limits. and it is for this reason that they cant let us know what their limits are ;) it would be like linkedin saying "ok guys pilfer our data against our TOS so long as you only pilfer this hard."

just go crazy. get banned. try again. find out what works. one set of limitations/timeouts/requests that works on my machine and network may very greatly from yours. many of these conditions are out of our control but the ones that are in our control are always changing. just make a lot of accounts, and go as slowly as one can reasonably go when scraping hundreds of thousands of human data :)

edit: i forgot to mention i was actually helping a friend set up a LI scraping operation the other night and we were doing 50 accounts through phantombuster. i will comment back with general results when i get them.

jeremiah-94 commented 4 years ago

I do alot of LinkedIn scraping as well to collect profile data for a predictive model I'm building. This platform, Mantheos has an API that allows me to extract LinkedIn profile data at mass. With other solutions, I usually find a volume limitation seems to be quite significant with a limited number of profiles extractable and the number of connection requests.

zaidharis2801 commented 1 year ago

@Aditya94A did you find anyhting?