spinlud / linkedin-jobs-scraper

144 stars 40 forks source link

Adding proxy options #17

Open omri08 opened 3 years ago

omri08 commented 3 years ago

Hi @spinlud 😄 What do you think about the option to add a proxy? I saw this library or: https://github.com/Cuadrix/puppeteer-page-proxy for adding a proxy per page/request.

I think the best use case for this option will be in when we getting 429 too many requests. We can add a proxy server from a list of proxy servers the user has and if this proxy server getting also 429 we will update again the proxy server.

we can maybe do something like this:

    ProxyListUserWantToUse = []
    scraper.on(events.scraper.error, (err) => {
        console.error(err);
        scraper.useProxy/updateProxy(takeNextProxyFromList())
    });

I don't have experience with puppeteer but if you think its something that can be done I don't mind dive into it 😄

spinlud commented 3 years ago

Hi @omri08 ! This seems a cool idea, I've just release a @next version with a proxies option in the constructor:

const scraper = new LinkedinScraper({
    headless: false,
    slowMo: 100,
    args: [
        "--lang=en-GB",
    ],
    proxies: [
        '<proxy1_address>:<proxy1_port>',
        '<proxy2_address>:<proxy2_port>',
        // ...
    ]
});

To be honest I know very little about proxy usage, I've tried the following fast attempts for testing:

var proxy = httpProxy.createServer();

var server = http.createServer(function (req, res) { console.log('Receiving reverse proxy request for:' + req.url); var parsedUrl = url.parse(req.url); var target = parsedUrl.protocol + '//' + parsedUrl.hostname; proxy.web(req, res, {target: target, secure: false}); }).listen(8888);

server.on('connect', function (req, socket) { console.log('Receiving reverse proxy request for:' + req.url);

var serverUrl = url.parse('https://' + req.url);

var srvSocket = net.connect(serverUrl.port, serverUrl.hostname, function() {
    socket.write('HTTP/1.1 200 Connection Established\r\n' +
        'Proxy-agent: Node-Proxy\r\n' +
        '\r\n');
    srvSocket.pipe(socket);
    socket.pipe(srvSocket);
});

});


* Using [docker-tinyproxy](https://github.com/monokal/docker-tinyproxy):
```sh
docker run -d --name='tinyproxy' -p 8888:8888 monokal/tinyproxy:latest 'ANY'

Both failed. I see some requests forwarded and page loaded but at some point it stops working. Honestly I don't know if this is a problem on the proxy (likely) or on Puppeteer/puppeteer-page-proxy side.

If you want to spend some time diving into it that's great! 😎

You can install this new version using npm install linkedin-jobs-scraper@next. Let me know if you find something we have to fix to make it working properly 🙃

omri08 commented 3 years ago

Great! I will try to test in the next few days and give you an update with the result 😄

omri08 commented 3 years ago

Hi @spinlud, So I tried to work on this today but unfortunately couldn't solve the problem. I have created a proxy server in AWS EC2 to be sure the problem is not with the local proxy. I also tried to change the proxy library from puppeteer-page-proxy to puppeteer-proxy but got the same result. This is the log I'm always getting:

  scraper:info Env variable LI_AT_COOKIE detected. Using LoggedInRunStrategy +0ms
  scraper:info Setting chrome launch options {
  headless: false,
  args: [
    '--enable-automation',
    '--start-maximized',
    '--window-size=1472,828',
    '--lang=en-GB',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-gpu',
    '--disable-dev-shm-usage',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    "--proxy-server='direct://",
    '--proxy-bypass-list=*',
    '--disable-accelerated-2d-canvas',
    '--disable-gpu',
    '--allow-running-insecure-content',
    '--disable-web-security',
    '--disable-client-side-phishing-detection',
    '--disable-notifications',
    '--mute-audio',
    '--lang=en-GB'
  ],
  defaultViewport: null,
  pipe: true,
  slowMo: 100,
  proxies: [ 'http://18.193.119.8:9876' ]
} +3ms
  scraper:info [ ][Israel] Starting new query: query=" " location="Israel" +802ms
  scraper:info [ ][Israel] Query options { locations: [ 'Israel' ], limit: 10000, optimize: true } +0ms
  scraper:info Setting authentication cookie +4s
  scraper:info [ ][Israel] Opening https://www.linkedin.com/jobs/search?keywords=+&location=Israel&redirect=false&position=1&pageNum=0 +214ms
  scraper:warn [ ][Israel] 400 Error for request https://www.linkedin.com/homepage-guest/api/ingraphs/gauge?csrfToken=ajax%3A7235276711670027529 +0ms
  scraper:info [ ][Israel] Session is valid +10s
  scraper:info [ ][Israel] Jobs fetched: 7 +111ms
  scraper:error [ ][Israel][1] Timeout on loading job details +0ms
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:error [ ][Israel][1] Timeout on loading job details +5s
  scraper:info [ ][Israel][1] Pagination requested (2) +32s
  scraper:info [ ][Israel][1] Timeout on loading more jobs +2s
  scraper:info [ ][Israel][1] There are no more jobs available for the current query +1ms
All done!

And these are the logs from the proxy server: image

Anyway thanks for the effort, sorry for not be able to solve this issue 😅

spinlud commented 3 years ago

Ehi, thanks anyway for trying! It probably requires more time to investigate. I am very busy right now but in case of any update on the matter I'll let you know! 😎