Closed RenzoPederzoli closed 1 year ago
Where should I put this code?
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
]
const scraper = new LinkedinScraper({
headless: true,
slowMo: 10,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
],
});
I am still getting this error in the Heroku logs: TimeoutError: waiting for selector "form#JOBS" failed: Thanks for the prompt reply btw
@RenzoPederzoli I am on vacation, will look into that when I am back 😉
Hi again! The problem seems to be related to Linkedin redirecting to an authwall for certain IPs. I suspect that any anonymous traffic coming from AWS (which includes also Heroku I think) will be most likely redirected to the authwall (thus authentication is required). I can't do very much about that (meaning allowing anonymous traffic on Linkedin from AWS IPs), but I've added an authenticated session mode which should overcome the issue (of course you need a valid Linkedin account). The initial goal of this library was to provide jobs scraping functionality without authentication, but I see that for running on environments such as AWS or Heroku there is no other way than using an authenticated session AFAIK.
Please check the documentation anonymous-vs-authenticated-session. I've tested both on AWS and Heroku and it seems to work, but mind that Linkedin rate limiting is much more aggressive when using an authenticated session and the scraper is much slower.
Let me know if this solves your issue!
Ahhh okay thank you so much, appreciate it! I'll give it a try this weekend and let you know.
hi @spinlud - thanks for building the library. it's very helpful.
Adding this for reference / completeness ...
I find the same anonymous issue on Google Cloud Platform. I had previously had connections issues outside the VM but these were resolved with firewall rules. For this test, firewall rules were enabled:
gcloud compute --project=$1 firewall-rules create default-allow-http \
--direction=INGRESS --priority=1000 --network=default \
--action=ALLOW --rules=tcp:80 \
--source-ranges=0.0.0.0/0 --target-tags=http-server
and here is the error I get on running the scraper inside Google Cloud VM / compute:
scraper:error Error: Protocol error (Target.setDiscoverTargets): Target closed.
scraper:error at /home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:71:63
scraper:error at new Promise (<anonymous>)
scraper:error at Connection.send (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:70:16)
scraper:error at Function.create (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:95:26)
scraper:error at ChromeLauncher.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/node/Launcher.js:99:56)
scraper:error at async PuppeteerExtra.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer-extra/dist/index.cjs.js:129:25) +0ms
Error: Protocol error (Target.setDiscoverTargets): Target closed.
at /home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:71:63
at new Promise (<anonymous>)
at Connection.send (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:70:16)
at Function.create (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:95:26)
at ChromeLauncher.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer/lib/cjs/puppeteer/node/Launcher.js:99:56)
at async PuppeteerExtra.launch (/home/pp/linkedin-jobs-scraper/node_modules/puppeteer-extra/dist/index.cjs.js:129:25)
pp@vm-linkedin-01:~/linkedin-jobs-scraper$
P.S. I am using chromedriver with Python for other scrapers on GCP compute and they work fine but not for Linkedin.
Hi @ppetruneac-da!
Are you running the application in headless mode (headless: true
)? Did you test puppeteer with a sample application on GCP (e.g. just opening a website)?
Also make sure that you are passing the right Chrome options for your environment (GCP). Default are:
[
"--start-maximized",
"--window-size=1472,828",
"--lang=en-GB",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-gpu",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--proxy-server='direct://",
"--proxy-bypass-list=*",
"--disable-accelerated-2d-canvas",
"--disable-gpu",
"--allow-running-insecure-content",
"--disable-web-security",
"--disable-client-side-phishing-detection",
"--disable-notifications",
"--mute-audio",
"--enable-automation",
]
You can override them in the scraper constructor:
const scraper = new LinkedinScraper({
headless: true,
slowMo: 50,
args: [
"--lang=en-GB",
],
});
I tested in headless mode. Only changed the slowMo
value, all other values you referenced were the same.
Have not tested puppeteer with a sample application on GCP (e.g. just opening a website) -- I will do a test and come back.
hi @spinlud - here is the test I run:
Create a VM, allow --tags=http-server,https-server
:
gcloud beta compute \
--project=ID instances create vm-test \
--zone=us-central1-a --machine-type=e2-medium --subnet=default \
--network-tier=PREMIUM --maintenance-policy=MIGRATE \
--service-account=ID-compute@developer.gserviceaccount.com \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--tags=http-server,https-server --image=ubuntu-1804-bionic-v20201111 \
--image-project=ubuntu-os-cloud --boot-disk-size=10GB --boot-disk-type=pd-standard \
--boot-disk-device-name=vm-test --no-shielded-secure-boot --shielded-vtpm \
--shielded-integrity-monitoring --reservation-affinity=any
cat /etc/os-release:
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Install some requirements:
sudo -S curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -
sudo apt-get install -y nodejs
npm install puppeteer
Copy this into example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
When I run it, node example.js
, I get this error:
mac@vm-test:~$ node example.js
(node:4302) UnhandledPromiseRejectionWarning: Error: Failed to launch the browser process!
/home/mac/node_modules/puppeteer/.local-chromium/linux-809590/chrome-linux/chrome: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
at onClose (/home/mac/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
at Interface.<anonymous> (/home/mac/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
at Interface.emit (events.js:326:22)
at Interface.close (readline.js:416:8)
at Socket.onend (readline.js:194:10)
at Socket.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
(node:4302) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:4302) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Am not very familiar with js but the same <anonymous>
comes up. I am using Chromedriver in Python on GCP and it works fine, but am not familiar with puppeteer.
Any thoughts on the above? There are many resources on using puppeteer in serverless mode with cloud functions but not too much on VMs.
@ppetruneac-da probably some dependencies are missing? https://medium.com/@ssmak/how-to-fix-puppetteer-error-while-loading-shared-libraries-libx11-xcb-so-1-c1918b75acc3
Also you may want to look at a Python alternative (selenium + chrome driver): https://github.com/spinlud/py-linkedin-jobs-scraper
@spinlud I have not seen the python one :D Thanks. I will give it a go and let you know how it goes.
@spinlud I tested the Python repo and it's the same. You'd need to authenticate. This is what you get:
INFO:li:scraper:('Implementing strategy AnonymousStrategy',)
INFO:li:scraper:('Starting new query', "Query(query=Engineer options=QueryOptions(limit=10 locations=['Worldwide'] optimize=True))")
INFO:li:scraper:('[Engineer][Worldwide]', 'Opening https://www.linkedin.com/jobs/search?keywords=Engineer&location=Worldwide&redirect=false&position=1&pageNum=0')
WARNING:li:scraper:('[Engineer][Worldwide]', 'Error in response', 'request_id=00D8C9A98A721F9FCD6827A7C56F537D status=999 type=Document mime_type=text/html url=https://www.linkedin.com/jobs/search?keywords=Engineer&location=Worldwide&redirect=false&position=1&pageNum=0')
ERROR:li:scraper:('Scraper failed to run in anonymous mode, authentication may be necessary for this environment. Please check the documentation on how to use an authenticated session.',)
NoneType: None
[ON_END]
@ppetruneac-da That's not an error. If Linkedin requires authentication for traffic generated from AWS it is very likely it asks the same for machines hosted in GCP. You probably need to run your application using an authenticated session, as the logs suggest.
Oh yes! I just made the comment for completion as promised so others will see this too :) Thanks again for your great work!
Oh yes! I just made the comment for completion as promised so others will see this too :) Thanks again for your great work!
You are welcome, for any problem with the python library please report in the repo issues section
I'm using dedicated server from Hetzner
company and getting also 999
error code even with authenticated user. I mean (LI_AT_COOKIE) if I'm not wrong.
Seems something else effecting the server but still not sure what is or why?
In local machine, it works perfect even with anonymous search.
More context
I'm using docker with the following configuration:
# A minimal Docker image with Node and Puppeteer
#
# Initially based upon:
# https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md#running-puppeteer-in-docker
FROM node:12.22.0-buster-slim
RUN apt-get update \
&& apt-get install -y wget gnupg gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 \
libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 \
libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 \
libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
&& apt-get update \
# We install Chrome to get all the OS level dependencies, but Chrome itself
# is not actually used as it's packaged in the node puppeteer library.
# Alternatively, we could could include the entire dep list ourselves
# (https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-doesnt-launch-on-unix)
# but that seems too easy to get out of date.
&& apt-get install -y google-chrome-stable fonts-kacst fonts-freefont-ttf \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
# && wget --quiet https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh -O /usr/sbin/wait-for-it.sh \
# && chmod +x /usr/sbin/wait-for-it.sh
# Install Puppeteer under /node_modules so it's available system-wide
ADD package.json yarn.lock /
RUN yarn
WORKDIR /usr/jobs_scrap/
COPY ./jobs_scrap/package*.json ./
RUN yarn
COPY ./jobs_scrap/ .
Maybe there is a missing library or somthing with puppeteer configuration !!
I found some saying about user-aginet but the library does use random user-agent so this is not the issue i believe.
Will try to update you if i find something.
Added the buildpack to run on heroku but it is still failing on live site. Works properly on localhost. Error log below is from heroku log