spinlud / linkedin-jobs-scraper

147 stars 40 forks source link

Scraper failing on heroku build #10

Closed RenzoPederzoli closed 1 year ago

RenzoPederzoli commented 3 years ago

Added the buildpack to run on heroku but it is still failing on live site. Works properly on localhost. Error log below is from heroku log

2020-08-06T15:38:50.301805+00:00 app[web.1]: 2020-08-06T15:38:50.301Z scraper:info [Front-end Engineer][Miami] Page loaded
2020-08-06T15:39:00.356731+00:00 heroku[router]: at=info method=GET path="/linkedin-results/Miami/Front-end%20Engineer" host=ironjobs.herokuapp.com request_id=ffb267fd-3e26-4be0-9112-872d8fd4d135 fwd="162.196.252.246" dyno=web.1 connect=1ms service=15325ms status=200 bytes=321 protocol=https
2020-08-06T15:39:00.305984+00:00 app[web.1]: 2020-08-06T15:39:00.305Z scraper:error TimeoutError: waiting for selector "form#JOBS" failed: timeout 10000ms exceeded
2020-08-06T15:39:00.306001+00:00 app[web.1]:     at new WaitTask (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:394:34)
2020-08-06T15:39:00.306001+00:00 app[web.1]:     at DOMWorld._waitForSelectorOrXPath (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:326:26)
2020-08-06T15:39:00.306002+00:00 app[web.1]:     at DOMWorld.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:309:21)
2020-08-06T15:39:00.306002+00:00 app[web.1]:     at Frame.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:801:51)
2020-08-06T15:39:00.306003+00:00 app[web.1]:     at Page.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:1215:33)
2020-08-06T15:39:00.306003+00:00 app[web.1]:     at LinkedinScraper.<anonymous> (/app/backend/node_modules/linkedin-jobs-scraper/build/scraper/LinkedinScraper.js:200:26)
2020-08-06T15:39:00.306004+00:00 app[web.1]:     at Generator.next (<anonymous>)
2020-08-06T15:39:00.306004+00:00 app[web.1]:     at fulfilled (/app/backend/node_modules/linkedin-jobs-scraper/build/scraper/LinkedinScraper.js:5:58)
2020-08-06T15:39:00.306440+00:00 app[web.1]: TimeoutError: waiting for selector "form#JOBS" failed: timeout 10000ms exceeded
2020-08-06T15:39:00.306441+00:00 app[web.1]:     at new WaitTask (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:394:34)
2020-08-06T15:39:00.306442+00:00 app[web.1]:     at DOMWorld._waitForSelectorOrXPath (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:326:26)
2020-08-06T15:39:00.306442+00:00 app[web.1]:     at DOMWorld.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/DOMWorld.js:309:21)
2020-08-06T15:39:00.306443+00:00 app[web.1]:     at Frame.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:801:51)
2020-08-06T15:39:00.306443+00:00 app[web.1]:     at Page.waitForSelector (/app/backend/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:1215:33)
2020-08-06T15:39:00.306443+00:00 app[web.1]:     at LinkedinScraper.<anonymous> (/app/backend/node_modules/linkedin-jobs-scraper/build/scraper/LinkedinScraper.js:200:26)
2020-08-06T15:39:00.306444+00:00 app[web.1]:     at Generator.next (<anonymous>)
2020-08-06T15:39:00.306444+00:00 app[web.1]:     at fulfilled (/app/backend/node_modules/linkedin-jobs-scraper/build/scraper/LinkedinScraper.js:5:58)
2020-08-06T15:39:00.343858+00:00 app[web.1]: GET /linkedin-results/Miami/Front-end%20Engineer 200 15304.326 ms - 2
spinlud commented 3 years ago

Hi! Have you checked if requirements mentioned here and here are satisfied?

I have no experience with Heroku build packs, let me know.

RenzoPederzoli commented 3 years ago

Where should I put this code?

args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
  ]
spinlud commented 3 years ago
const scraper = new LinkedinScraper({
    headless: true,
    slowMo: 10,
    args: [
        "--no-sandbox",
        "--disable-setuid-sandbox",
    ],
});
RenzoPederzoli commented 3 years ago

I am still getting this error in the Heroku logs: TimeoutError: waiting for selector "form#JOBS" failed: Thanks for the prompt reply btw

spinlud commented 3 years ago

@RenzoPederzoli I am on vacation, will look into that when I am back 😉

spinlud commented 3 years ago

Hi again! The problem seems to be related to Linkedin redirecting to an authwall for certain IPs. I suspect that any anonymous traffic coming from AWS (which includes also Heroku I think) will be most likely redirected to the authwall (thus authentication is required). I can't do very much about that (meaning allowing anonymous traffic on Linkedin from AWS IPs), but I've added an authenticated session mode which should overcome the issue (of course you need a valid Linkedin account). The initial goal of this library was to provide jobs scraping functionality without authentication, but I see that for running on environments such as AWS or Heroku there is no other way than using an authenticated session AFAIK.

Please check the documentation anonymous-vs-authenticated-session. I've tested both on AWS and Heroku and it seems to work, but mind that Linkedin rate limiting is much more aggressive when using an authenticated session and the scraper is much slower.

Let me know if this solves your issue!

RenzoPederzoli commented 3 years ago

Ahhh okay thank you so much, appreciate it! I'll give it a try this weekend and let you know.

ppetruneac commented 3 years ago

hi @spinlud - thanks for building the library. it's very helpful.

Adding this for reference / completeness ...

I find the same anonymous issue on Google Cloud Platform. I had previously had connections issues outside the VM but these were resolved with firewall rules. For this test, firewall rules were enabled:

gcloud compute --project=$1 firewall-rules create default-allow-http \
  --direction=INGRESS --priority=1000 --network=default \
  --action=ALLOW --rules=tcp:80 \
  --source-ranges=0.0.0.0/0 --target-tags=http-server

and here is the error I get on running the scraper inside Google Cloud VM / compute:

  scraper:error Error: Protocol error (Target.setDiscoverTargets): Target closed.
  scraper:error     at /home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:71:63
  scraper:error     at new Promise (<anonymous>)
  scraper:error     at Connection.send (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:70:16)
  scraper:error     at Function.create (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:95:26)
  scraper:error     at ChromeLauncher.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/node/Launcher.js:99:56)
  scraper:error     at async PuppeteerExtra.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer-extra/dist/index.cjs.js:129:25) +0ms
Error: Protocol error (Target.setDiscoverTargets): Target closed.
    at /home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:71:63
    at new Promise (<anonymous>)
    at Connection.send (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:70:16)
    at Function.create (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:95:26)
    at ChromeLauncher.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/node/Launcher.js:99:56)
    at async PuppeteerExtra.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer-extra/dist/index.cjs.js:129:25)
pp@vm-linkedin-01:~/linkedin-jobs-scraper$

P.S. I am using chromedriver with Python for other scrapers on GCP compute and they work fine but not for Linkedin.

spinlud commented 3 years ago

Hi @ppetruneac-da! Are you running the application in headless mode (headless: true)? Did you test puppeteer with a sample application on GCP (e.g. just opening a website)?

Also make sure that you are passing the right Chrome options for your environment (GCP). Default are:

[
        "--start-maximized",
        "--window-size=1472,828",
        "--lang=en-GB",
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-gpu",
        "--disable-dev-shm-usage",
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--proxy-server='direct://",
        "--proxy-bypass-list=*",
        "--disable-accelerated-2d-canvas",
        "--disable-gpu",
        "--allow-running-insecure-content",
        "--disable-web-security",
        "--disable-client-side-phishing-detection",
        "--disable-notifications",
        "--mute-audio",
        "--enable-automation",
]        

You can override them in the scraper constructor:

const scraper = new LinkedinScraper({
        headless: true,
        slowMo: 50,
        args: [
            "--lang=en-GB",
        ],
    });
ppetruneac commented 3 years ago

I tested in headless mode. Only changed the slowMo value, all other values you referenced were the same.
Have not tested puppeteer with a sample application on GCP (e.g. just opening a website) -- I will do a test and come back.

ppetruneac commented 3 years ago

hi @spinlud - here is the test I run:

Create a VM, allow --tags=http-server,https-server:

gcloud beta compute \
    --project=ID instances create vm-test \
    --zone=us-central1-a --machine-type=e2-medium --subnet=default \
    --network-tier=PREMIUM --maintenance-policy=MIGRATE \
    --service-account=ID-compute@developer.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --tags=http-server,https-server --image=ubuntu-1804-bionic-v20201111 \
    --image-project=ubuntu-os-cloud --boot-disk-size=10GB --boot-disk-type=pd-standard \
    --boot-disk-device-name=vm-test --no-shielded-secure-boot --shielded-vtpm \
    --shielded-integrity-monitoring --reservation-affinity=any

cat /etc/os-release:

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Install some requirements:

sudo -S curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -
sudo apt-get install -y nodejs
npm install puppeteer

Copy this into example.js

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});

  await browser.close();
})();

When I run it, node example.js, I get this error:

mac@vm-test:~$ node example.js
(node:4302) UnhandledPromiseRejectionWarning: Error: Failed to launch the browser process!
/home/mac/node_modules/puppeteer/.local-chromium/linux-809590/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/home/mac/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
    at Interface.<anonymous> (/home/mac/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
    at Interface.emit (events.js:326:22)
    at Interface.close (readline.js:416:8)
    at Socket.onend (readline.js:194:10)
    at Socket.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
(node:4302) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:4302) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Am not very familiar with js but the same <anonymous> comes up. I am using Chromedriver in Python on GCP and it works fine, but am not familiar with puppeteer.

Any thoughts on the above? There are many resources on using puppeteer in serverless mode with cloud functions but not too much on VMs.

spinlud commented 3 years ago

@ppetruneac-da probably some dependencies are missing? https://medium.com/@ssmak/how-to-fix-puppetteer-error-while-loading-shared-libraries-libx11-xcb-so-1-c1918b75acc3

Also you may want to look at a Python alternative (selenium + chrome driver): https://github.com/spinlud/py-linkedin-jobs-scraper

ppetruneac commented 3 years ago

@spinlud I have not seen the python one :D Thanks. I will give it a go and let you know how it goes.

ppetruneac commented 3 years ago

@spinlud I tested the Python repo and it's the same. You'd need to authenticate. This is what you get:

INFO:li:scraper:('Implementing strategy AnonymousStrategy',)
INFO:li:scraper:('Starting new query', "Query(query=Engineer options=QueryOptions(limit=10 locations=['Worldwide'] optimize=True))")
INFO:li:scraper:('[Engineer][Worldwide]', 'Opening https://www.linkedin.com/jobs/search?keywords=Engineer&location=Worldwide&redirect=false&position=1&pageNum=0')
WARNING:li:scraper:('[Engineer][Worldwide]', 'Error in response', 'request_id=00D8C9A98A721F9FCD6827A7C56F537D status=999 type=Document mime_type=text/html url=https://www.linkedin.com/jobs/search?keywords=Engineer&location=Worldwide&redirect=false&position=1&pageNum=0')
ERROR:li:scraper:('Scraper failed to run in anonymous mode, authentication may be necessary for this environment. Please check the documentation on how to use an authenticated session.',)
NoneType: None
[ON_END]
spinlud commented 3 years ago

@ppetruneac-da That's not an error. If Linkedin requires authentication for traffic generated from AWS it is very likely it asks the same for machines hosted in GCP. You probably need to run your application using an authenticated session, as the logs suggest.

ppetruneac commented 3 years ago

Oh yes! I just made the comment for completion as promised so others will see this too :) Thanks again for your great work!

spinlud commented 3 years ago

Oh yes! I just made the comment for completion as promised so others will see this too :) Thanks again for your great work!

You are welcome, for any problem with the python library please report in the repo issues section

AhmedBHameed commented 3 years ago

I'm using dedicated server from Hetzner company and getting also 999 error code even with authenticated user. I mean (LI_AT_COOKIE) if I'm not wrong.

Seems something else effecting the server but still not sure what is or why?

In local machine, it works perfect even with anonymous search.

More context

I'm using docker with the following configuration:

# A minimal Docker image with Node and Puppeteer
#
# Initially based upon:
# https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md#running-puppeteer-in-docker

FROM node:12.22.0-buster-slim

RUN  apt-get update \
     && apt-get install -y wget gnupg gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 \
     libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 \
     libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 \
     libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils \
     && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
     && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
     && apt-get update \
     # We install Chrome to get all the OS level dependencies, but Chrome itself
     # is not actually used as it's packaged in the node puppeteer library.
     # Alternatively, we could could include the entire dep list ourselves
     # (https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-doesnt-launch-on-unix)
     # but that seems too easy to get out of date.
     && apt-get install -y google-chrome-stable fonts-kacst fonts-freefont-ttf \
     --no-install-recommends \
     && rm -rf /var/lib/apt/lists/*
# && wget --quiet https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh -O /usr/sbin/wait-for-it.sh \
# && chmod +x /usr/sbin/wait-for-it.sh

# Install Puppeteer under /node_modules so it's available system-wide
ADD package.json yarn.lock /

RUN yarn

WORKDIR /usr/jobs_scrap/

COPY ./jobs_scrap/package*.json ./

RUN yarn

COPY ./jobs_scrap/ .

Maybe there is a missing library or somthing with puppeteer configuration !!

I found some saying about user-aginet but the library does use random user-agent so this is not the issue i believe.

Will try to update you if i find something.