Open omri08 opened 3 years ago
Hi @omri08 !
This seems a cool idea, I've just release a @next
version with a proxies
option in the constructor:
const scraper = new LinkedinScraper({
headless: false,
slowMo: 100,
args: [
"--lang=en-GB",
],
proxies: [
'<proxy1_address>:<proxy1_port>',
'<proxy2_address>:<proxy2_port>',
// ...
]
});
To be honest I know very little about proxy usage, I've tried the following fast attempts for testing:
const http = require('http'),
net = require('net'),
httpProxy = require('http-proxy'),
url = require('url'),
util = require('util');
var proxy = httpProxy.createServer();
var server = http.createServer(function (req, res) { console.log('Receiving reverse proxy request for:' + req.url); var parsedUrl = url.parse(req.url); var target = parsedUrl.protocol + '//' + parsedUrl.hostname; proxy.web(req, res, {target: target, secure: false}); }).listen(8888);
server.on('connect', function (req, socket) { console.log('Receiving reverse proxy request for:' + req.url);
var serverUrl = url.parse('https://' + req.url);
var srvSocket = net.connect(serverUrl.port, serverUrl.hostname, function() {
socket.write('HTTP/1.1 200 Connection Established\r\n' +
'Proxy-agent: Node-Proxy\r\n' +
'\r\n');
srvSocket.pipe(socket);
socket.pipe(srvSocket);
});
});
* Using [docker-tinyproxy](https://github.com/monokal/docker-tinyproxy):
```sh
docker run -d --name='tinyproxy' -p 8888:8888 monokal/tinyproxy:latest 'ANY'
Both failed. I see some requests forwarded and page loaded but at some point it stops working. Honestly I don't know if this is a problem on the proxy (likely) or on Puppeteer/puppeteer-page-proxy side.
If you want to spend some time diving into it that's great! 😎
You can install this new version using npm install linkedin-jobs-scraper@next
.
Let me know if you find something we have to fix to make it working properly 🙃
Great! I will try to test in the next few days and give you an update with the result 😄
Hi @spinlud,
So I tried to work on this today but unfortunately couldn't solve the problem.
I have created a proxy server in AWS EC2 to be sure the problem is not with the local proxy.
I also tried to change the proxy library from puppeteer-page-proxy
to puppeteer-proxy
but got the same result.
This is the log I'm always getting:
scraper:info Env variable LI_AT_COOKIE detected. Using LoggedInRunStrategy +0ms
scraper:info Setting chrome launch options {
headless: false,
args: [
'--enable-automation',
'--start-maximized',
'--window-size=1472,828',
'--lang=en-GB',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
"--proxy-server='direct://",
'--proxy-bypass-list=*',
'--disable-accelerated-2d-canvas',
'--disable-gpu',
'--allow-running-insecure-content',
'--disable-web-security',
'--disable-client-side-phishing-detection',
'--disable-notifications',
'--mute-audio',
'--lang=en-GB'
],
defaultViewport: null,
pipe: true,
slowMo: 100,
proxies: [ 'http://18.193.119.8:9876' ]
} +3ms
scraper:info [ ][Israel] Starting new query: query=" " location="Israel" +802ms
scraper:info [ ][Israel] Query options { locations: [ 'Israel' ], limit: 10000, optimize: true } +0ms
scraper:info Setting authentication cookie +4s
scraper:info [ ][Israel] Opening https://www.linkedin.com/jobs/search?keywords=+&location=Israel&redirect=false&position=1&pageNum=0 +214ms
scraper:warn [ ][Israel] 400 Error for request https://www.linkedin.com/homepage-guest/api/ingraphs/gauge?csrfToken=ajax%3A7235276711670027529 +0ms
scraper:info [ ][Israel] Session is valid +10s
scraper:info [ ][Israel] Jobs fetched: 7 +111ms
scraper:error [ ][Israel][1] Timeout on loading job details +0ms
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:error [ ][Israel][1] Timeout on loading job details +5s
scraper:info [ ][Israel][1] Pagination requested (2) +32s
scraper:info [ ][Israel][1] Timeout on loading more jobs +2s
scraper:info [ ][Israel][1] There are no more jobs available for the current query +1ms
All done!
And these are the logs from the proxy server:
Anyway thanks for the effort, sorry for not be able to solve this issue 😅
Ehi, thanks anyway for trying! It probably requires more time to investigate. I am very busy right now but in case of any update on the matter I'll let you know! 😎
Hi @spinlud 😄 What do you think about the option to add a proxy? I saw this library or: https://github.com/Cuadrix/puppeteer-page-proxy for adding a proxy per page/request.
I think the best use case for this option will be in when we getting
429 too many requests
. We can add a proxy server from a list of proxy servers the user has and if this proxy server getting also 429 we will update again the proxy server.we can maybe do something like this:
I don't have experience with puppeteer but if you think its something that can be done I don't mind dive into it 😄