Closed EmanFateen closed 4 years ago
I was able to scrape your profile just fine on the latest version 1.0.20
and using cookies.
const fs = require("fs");
const scrapedin = require("scrapedin");
const puppeteer = require('puppeteer')
async function main() {
let browser;
try {
console.log("Scraping");
const linkedInUrl = "https://www.linkedin.com/in/emanfateen/";
const cookies = fs.readFileSync("cookies.json");
const puppeteerArgs = {}
const args = Object.assign({
headless: true,
args: ['--no-sandbox']
}, puppeteerArgs)
browser = await puppeteer.launch(args)
const wsEndpoint = await browser.wsEndpoint()
const options = {
cookies: JSON.parse(cookies.toString()),
hasToLog: true,
isHeadless: true,
endpoint: wsEndpoint
};
const profileScraper = await scrapedin(options);
const linkedInProfile = await profileScraper(linkedInUrl);
console.log(linkedInProfile);
} catch (err) {
console.log(err)
} finally {
if (browser) {
await browser.close()
}
}
}
main();
Have you tried scraping a profile that is not your own ? Sometimes linkedin displays different pages for when you are viewing your own profile which can potentially break the scraper.
@gautierdag I tried now! get the same error!
after using your code here is my log
scrapedin: 2020-07-08T10:11:21.788Z info: [scrapedin.js] initializing scrapedin: 2020-07-08T10:11:21.790Z info: [scrapedin.js] using cookies, login will be bypassed scrapedin: 2020-07-08T10:11:21.790Z info: [profile/profile.js] starting scraping url: https://www.linkedin.com/in/mohamedmed7at/ scrapedin: 2020-07-08T10:11:28.332Z warn: [profile/profile.js] profile selector was not found scrapedin: 2020-07-08T10:11:28.332Z info: [profile/profile.js] scrolling page to the bottom scrapedin: 2020-07-08T10:11:28.838Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (1) scrapedin: 2020-07-08T10:11:29.342Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (2) scrapedin: 2020-07-08T10:11:29.846Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (3) scrapedin: 2020-07-08T10:11:30.364Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (4) scrapedin: 2020-07-08T10:11:30.871Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (5) scrapedin: 2020-07-08T10:11:31.376Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (6) scrapedin: 2020-07-08T10:11:31.880Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (7) scrapedin: 2020-07-08T10:11:32.384Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (8) scrapedin: 2020-07-08T10:11:32.890Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (9) scrapedin: 2020-07-08T10:11:33.394Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (10) scrapedin: 2020-07-08T10:11:33.898Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (11) scrapedin: 2020-07-08T10:11:34.403Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (12) scrapedin: 2020-07-08T10:11:34.907Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (13) scrapedin: 2020-07-08T10:11:35.411Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (14) scrapedin: 2020-07-08T10:11:35.915Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (15) scrapedin: 2020-07-08T10:11:36.419Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (16) scrapedin: 2020-07-08T10:11:36.923Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (17) scrapedin: 2020-07-08T10:11:37.427Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (18) scrapedin: 2020-07-08T10:11:37.931Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (19) scrapedin: 2020-07-08T10:11:38.435Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (20) scrapedin: 2020-07-08T10:11:38.940Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (21) scrapedin: 2020-07-08T10:11:39.444Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (22) scrapedin: 2020-07-08T10:11:39.948Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (23) scrapedin: 2020-07-08T10:11:40.453Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (24) scrapedin: 2020-07-08T10:11:40.959Z info: [profile/scrollToPageBottom.js] scrolling to page bottom (25) scrapedin: 2020-07-08T10:11:40.959Z warn: [profile/scrollToPageBottom.js] page bottom not found scrapedin: 2020-07-08T10:11:40.959Z info: [profile/profile.js] applying 1st delay scrapedin: 2020-07-08T10:11:41.257Z info: [profile/profile.js] applying 2nd (and last) delay scrapedin: 2020-07-08T10:11:41.592Z info: [profile/profile.js] finished scraping url: https://www.linkedin.com/in/mohamedmed7at/ TypeError: Cannot read property 'name' of undefined at module.exports (/media/eman/programs/projects/projecr/crawler-latest/node_modules/scrapedin/src/profile/cleanProfileData.js:5:23) at module.exports (/media/eman/programs/projects/project/crawler-latest/node_modules/scrapedin/src/profile/profile.js:79:26) at processTicksAndRejections (internal/process/task_queues.js:93:5) at async Function.start (/media/eman/programs/projects/projects/crawler-latest/scraper.js:81:28) at async Socket.
(/media/eman/programs/projects/projects/crawler-latest/crawler.js:39:7)
Did you try running locally? Can you try running it with headless: false
and isHeadless: false
to see what the scraper sees. I have a feeling Linkedin might be asking you to solve a captcha or to verify that it is you, and is not actually successful in reaching the profile page. Maybe try to refresh the cookies you obtained and try with ones from a fresh login?
Unfortunately, Linkedin can detect if you are not using your normal IP address so if you are using an account which you have created / are using regularly in one place, but then try to run the scraping from a server on another IP you might be asked to verify a captcha or email code - which you cannot do since it will be in headless. Sometimes Linkedin will also allow a couple requests before requiring this check, so it might work for a little bit until you are blocked and have to reset your cookies..
@gautierdag yeah thank you I made headless: false, and I found that the linked in used in login has been blocked!!! I made a new one and after scrape 10 times, it also blocked!!
any advice to secure the linkedin account from blocking!!?
@gautierdag you are right, linked in had restricted my account, So, I created another account and used a new cookie, it worked for 30 min and then restrict the account a again, this happens in the remote server but on my local machine, it worked fine. It's not possible to manually every 30 min extract new cookie and upload it to the server!
Is there anyway to solve this issue!! ?
Unfortunately no easy way around this and this library cannot solve this :/
The best you can do is make sure that the Linkedin account you are using is aged (the older the better) and that it is verified (email + phone number). You should also try to not spam scrape, and slowly "warmup" the usage. So if you had a fresh account, you could scrape once the first day, twice the second day .. etc. This will be very slow and that's why you might want to have many verified accounts in parallel.
If you are consistently blocked on your remote server, it could be that its IP was flagged as suspect (from your previous attempts), and so will now be a lot more susceptible from being blocked by Linkedin. So you might want to obtain a fresh server (if using cloud provider) or use a VPN .. etc.
If you use your own account on your own computer you will not get blocked as often since Linkedin will think that it is you (same IP .. etc), and therefore that's why depending on your use case it might be just simpler to scrape locally.
Alternatively, there are catpcha solvers out there and ways to buy aged accounts but I have never used them as they are quite shady. You'd also have to modify the library to detect the captcha / block screen and surpass it.
@gautierdag Thank you. I got it
For those of you having these issues right now, try downgrading to puppeteer@5.3.1
.
Hello, I'm using scrapedin with version 1.0.20 and this line
const profile = await profileScraper('https://www.linkedin.com/in/emanfateen/')
and here is my logI'm logging by cookie and here is my settings scraper: { hasToLog: true, isHeadless: true, puppeteerArgs: { args: [ '--no-sandbox' ] }, interval: 13000, }
how could I solve this issue!!