linkedtales / scrapedin

LinkedIn Scraper (currently working 2020)
Apache License 2.0
597 stars 174 forks source link

Throw error when redirecting to authwall #134

Open acanimal opened 4 years ago

acanimal commented 4 years ago

It is posible your cookie credentials become invalid and LinkedIn redirects to the "authwall" where you need to login again.

The current code simple returns an empty profile object that generates an error like Cannot read property 'name' of undefined at module.exports (xxx/node_modules/scrapedin/src/profile/cleanProfileData.js:5:23)

At least for me, in that cases, it's necessary to know if the profile has failed due auth error and because of this I have modified slightly the profile.js file with the next lines:

module.exports = async (browser, cookies, url, waitTimeToScrapMs = 500, hasToGetContactInfo = false, puppeteerAuthenticate = undefined) => {
  ...
  const page = await openPage({ browser, cookies, url, puppeteerAuthenticate })

  let authwall = false;
  page.on('response', response => {
    const status = response.status()
    if ((status >= 300) && (status <= 399)) {
      const location = response.headers()['location'];
      if (location.includes('authwall')){
        authwall = true;
      }
    }
  })

  const profilePageIndicatorSelector = '.pv-profile-section'
  await page.waitFor(profilePageIndicatorSelector, { timeout: 5000 })
    .catch(() => {
      //why doesn't throw error instead of continuing scraping?
      //because it can be just a false negative meaning LinkedIn only changed that selector but everything else is fine :)
      logger.warn('profile selector was not found')
    })

  // If redirect to authwall is detected throw error
  if (authwall) {
    const msg = 'Redirected to authwall :( You need new credentials';
    logger.warn(msg);
    throw new Error(msg);
  }

  ...

I don't know if this is something you want to integrate in the project. If so, let me know and I will send a PR.

Thanks in advance.

leonardiwagner commented 4 years ago

send a PR for sure, I'm very busy realocating right now, however there are more people to review and approve it, once that's done I'll just publish the npm package.

Thank you.

acanimal commented 4 years ago

Done https://github.com/linkedtales/scrapedin/pull/136