spinlud / linkedin-jobs-scraper

151 stars 41 forks source link

Duplicated data #15

Closed leenash02 closed 3 years ago

leenash02 commented 3 years ago

Hey spinlud, thanks for the scraper, works perfectly! I passed an empty string as query and managed to get 3000 results, however turns out 2700ish of those are duplicants, the actual data I was able to obtain was up to 300 unique job postings, seems like the scraper was not able to bypass some paging mechanic in LinkedIn and looped over what it could reach. The data it obtained is excellent though, and I would love to utilize it to get more data. Any ideas? Cheers!

spinlud commented 3 years ago

Hi there! Can you share the code you are using?

tparvi commented 3 years ago

Not OP but I can confirm this. The log shows the url that is used to fetch the jobs. If I open that url in incognito windows I get 33 results but scraper returns 54 results. Some jobs are included multiple times. In my case there is no next page. All 33 jobs are displayed and bottom of the page says "You've viewed all jobs for this search".

Qerying following skills c#
  scraper:info Implementing LoggedOutRunStrategy. +0ms
Running scraper
  scraper:info Setting chrome launch options { headless: true,
  args:
   [ '--enable-automation',
     '--start-maximized',
     '--window-size=1472,828',
     '--lang=en-GB',
     '--no-sandbox',
     '--disable-setuid-sandbox',
     '--disable-gpu',
     '--disable-dev-shm-usage',
     '--no-sandbox',
     '--disable-setuid-sandbox',
     '--disable-dev-shm-usage',
     '--proxy-server=\'direct://',
     '--proxy-bypass-list=*',
     '--disable-accelerated-2d-canvas',
     '--disable-gpu',
     '--allow-running-insecure-content',
     '--disable-web-security',
     '--disable-client-side-phishing-detection',
     '--disable-notifications',
     '--mute-audio' ],
  defaultViewport: null,
  pipe: true,
  slowMo: 50 } +1ms
  scraper:info [c#][Finland] Starting new query: query="c#" location="Finland" +235ms
  scraper:info [c#][Finland] Query options { locations: [ 'Finland' ],
  limit: 500,
  optimize: true,
  filters: { relevance: 'DD', time: '1,2' } } +1ms
  scraper:info [c#][Finland] Opening https://www.linkedin.com/jobs/search?keywords=c%23&location=Finland&sortBy=DD&f_TP=1%2C2&redirect=false&position=1&pageNum=0 +860ms
  scraper:info [c#][Finland] Jobs fetched: 24 +3s
tparvi commented 3 years ago

For testing purposes I am using the following code:

const { 
  events,
  IData,
  LinkedinScraper,
  relevanceFilter,
  timeFilter
} = require('linkedin-jobs-scraper');

var numberOfJobsScraped = 0;
var numberOfScrapingErrors = 0;

(async () => {
  const scraper = new LinkedinScraper({
      headless: true,
      slowMo: 50,
  });

  scraper.on(events.scraper.data, (data) => {
      numberOfJobsScraped++;
      console.log('Got job', data.jobId);
  });

  scraper.on(events.scraper.error, (err) => {
    console.log(err);
    numberOfScrapingErrors++;
  });
  scraper.on(events.scraper.end, () => {
      console.log('Scraping ended');
  });

scraper.on(events.puppeteer.browser.targetcreated, () => {});
scraper.on(events.puppeteer.browser.targetchanged, () => {});
scraper.on(events.puppeteer.browser.targetdestroyed, () => {});
scraper.on(events.puppeteer.browser.disconnected, () => {});

    console.log('Running scraper');

    await scraper.run({
      query: 'c#',
      options: {
          locations: ['Finland'],
          limit: 500,
          filters: {
            relevance: relevanceFilter.RECENT,
            time: timeFilter.WEEK  
          }
      }
  }, {
      optimize: true
  });

  console.log('Closing browser');
  await scraper.close();

  console.log(`Jobs scraped: ${numberOfJobsScraped} Scraping Errors: ${numberOfScrapingErrors}`);
  console.log(`Scraping tool ended: ${new Date().toISOString()}`);
})();

Below you can see the output of that run. I am logging the jobId and from the output you can see that some jobIds are outputted multiple time e.g. 2313231277

  scraper:info Implementing LoggedOutRunStrategy. +0ms
Running scraper
  scraper:info Setting chrome launch options { headless: true,
  args:
   [ '--enable-automation',
     '--start-maximized',
     '--window-size=1472,828',
     '--lang=en-GB',
     '--no-sandbox',
     '--disable-setuid-sandbox',
     '--disable-gpu',
     '--disable-dev-shm-usage',
     '--no-sandbox',
     '--disable-setuid-sandbox',
     '--disable-dev-shm-usage',
     '--proxy-server=\'direct://',
     '--proxy-bypass-list=*',
     '--disable-accelerated-2d-canvas',
     '--disable-gpu',
     '--allow-running-insecure-content',
     '--disable-web-security',
     '--disable-client-side-phishing-detection',
     '--disable-notifications',
     '--mute-audio' ],
  defaultViewport: null,
  pipe: true,
  slowMo: 50 } +3ms
  scraper:info [c#][Finland] Starting new query: query="c#" location="Finland" +231ms
  scraper:info [c#][Finland] Query options { locations: [ 'Finland' ],
  limit: 500,
  optimize: true,
  filters: { relevance: 'DD', time: '1,2' } } +0ms
  scraper:info [c#][Finland] Opening https://www.linkedin.com/jobs/search?keywords=c%23&location=Finland&sortBy=DD&f_TP=1%2C2&redirect=false&position=1&pageNum=0 +856ms
  scraper:info [c#][Finland] Jobs fetched: 25 +3s
Got job 2339560193
  scraper:info [c#][Finland][1] Processed +344ms
Got job 2313231277
  scraper:info [c#][Finland][2] Processed +913ms
Got job 2352140648
  scraper:info [c#][Finland][3] Processed +810ms
Got job 2332637692
  scraper:info [c#][Finland][4] Processed +682ms
Got job 2337486044
  scraper:info [c#][Finland][5] Processed +671ms
Got job 2331660739
  scraper:info [c#][Finland][6] Processed +918ms
Got job 2350231453
  scraper:info [c#][Finland][7] Processed +670ms
Got job 2298699594
  scraper:info [c#][Finland][8] Processed +781ms
Got job 2322976854
  scraper:info [c#][Finland][9] Processed +811ms
Got job 2312805498
  scraper:info [c#][Finland][10] Processed +796ms
Got job 2349858398
  scraper:info [c#][Finland][11] Processed +813ms
Got job 2324012574
  scraper:info [c#][Finland][12] Processed +1s
Got job 2348331907
  scraper:info [c#][Finland][13] Processed +811ms
Got job 2332200182
  scraper:info [c#][Finland][14] Processed +668ms
Got job 2329224561
  scraper:info [c#][Finland][15] Processed +1s
Got job 2346743599
  scraper:info [c#][Finland][16] Processed +779ms
  scraper:error [c#][Finland][17] Timeout on loading job details +0ms
[c#][Finland][17]       Timeout on loading job details
Got job 2330867077
  scraper:info [c#][Finland][17] Processed +6s
Got job 2345736108
  scraper:info [c#][Finland][18] Processed +681ms
Got job 2328465421
  scraper:info [c#][Finland][19] Processed +794ms
Got job 2345329129
  scraper:info [c#][Finland][20] Processed +684ms
Got job 2328438853
  scraper:info [c#][Finland][21] Processed +794ms
Got job 2328427743
  scraper:info [c#][Finland][22] Processed +806ms
Got job 2344618161
  scraper:info [c#][Finland][23] Processed +669ms
Got job 2326262469
  scraper:info [c#][Finland][24] Processed +935ms
  scraper:info [c#][Finland][24] Fecthing new jobs +0ms
  scraper:info [c#][Finland][24] Checking for new jobs to load... +62ms
  scraper:info [c#][Finland][24] Jobs fetched: 31 +1s
Got job 2339560193
  scraper:info [c#][Finland][25] Processed +342ms
Got job 2313231277
  scraper:info [c#][Finland][26] Processed +313ms
Got job 2352140648
  scraper:info [c#][Finland][27] Processed +326ms
Got job 2332637692
  scraper:info [c#][Finland][28] Processed +311ms
Got job 2337486044
  scraper:info [c#][Finland][29] Processed +314ms
Got job 2331660739
  scraper:info [c#][Finland][30] Processed +311ms
Got job 2350231453
  scraper:info [c#][Finland][31] Processed +313ms
Got job 2298699594
  scraper:info [c#][Finland][32] Processed +312ms
Got job 2322976854
  scraper:info [c#][Finland][33] Processed +313ms
Got job 2312805498
  scraper:info [c#][Finland][34] Processed +325ms
Got job 2349858398
  scraper:info [c#][Finland][35] Processed +311ms
Got job 2324012574
  scraper:info [c#][Finland][36] Processed +345ms
Got job 2348331907
  scraper:info [c#][Finland][37] Processed +328ms
Got job 2332200182
  scraper:info [c#][Finland][38] Processed +326ms
Got job 2329224561
  scraper:info [c#][Finland][39] Processed +343ms
Got job 2346743599
  scraper:info [c#][Finland][40] Processed +326ms
  scraper:error [c#][Finland][41] Timeout on loading job details +18s
[c#][Finland][41]       Timeout on loading job details
Got job 2330867077
  scraper:info [c#][Finland][41] Processed +5s
Got job 2345736108
  scraper:info [c#][Finland][42] Processed +313ms
Got job 2328465421
  scraper:info [c#][Finland][43] Processed +312ms
Got job 2345329129
  scraper:info [c#][Finland][44] Processed +312ms
Got job 2328438853
  scraper:info [c#][Finland][45] Processed +314ms
Got job 2328427743
  scraper:info [c#][Finland][46] Processed +325ms
Got job 2344618161
  scraper:info [c#][Finland][47] Processed +311ms
Got job 2326262469
  scraper:info [c#][Finland][48] Processed +312ms
Got job 2326254625
  scraper:info [c#][Finland][49] Processed +813ms
Got job 2326247738
  scraper:info [c#][Finland][50] Processed +792ms
Got job 2348698389
  scraper:info [c#][Finland][51] Processed +683ms
Got job 2344399784
  scraper:info [c#][Finland][52] Processed +684ms
Got job 2326237530
  scraper:info [c#][Finland][53] Processed +809ms
Got job 2344284662
  scraper:info [c#][Finland][54] Processed +806ms
  scraper:info [c#][Finland][54] Fecthing new jobs +0ms
  scraper:info [c#][Finland][54] Checking for new jobs to load... +63ms
  scraper:info [c#][Finland][54] There are no more jobs available for the current query +3s
Scraping ended
Closing browser
Jobs scraped: 54 Scraping Errors: 2
Scraping tool ended: 2020-12-21T10:29:50.139Z
spinlud commented 3 years ago

Hi, thanks for sharing the code! I have found a bug in the jobs loop, could you retry with the latest version and see if this solves your issue?

tparvi commented 3 years ago

The latest version works. I didn't get any duplicates. Thank you!