spinlud / linkedin-jobs-scraper

147 stars 40 forks source link

Getting a TypeError for some reason after scraping for a few results #8

Closed rafaelmbsouza closed 3 years ago

rafaelmbsouza commented 4 years ago

First of all, thank you for the well-written code you brought together. Very useful and robust.

When running it from my computer for a few vacancies, I get one exception that interrupts the crawler, but I can't seem to find the reason for it.

Error: Evaluation failed: TypeError: Cannot read property 'getAttribute' of undefined at puppeteer_evaluation_script:12:83 at ExecutionContext._evaluateInternal (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\ExecutionContext.js:102:19) at async ExecutionContext.evaluate (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\ExecutionContext.js:33:16) at async _run (C:\projetos\analise_vagas\linkedin_scraper\node_modules\linkedin-jobs-scraper\scraper\LinkedinScraper.js:345:65) at async LinkedinScraper.run (C:\projetos\analise_vagas\linkedin_scraper\node_modules\linkedin-jobs-scraper\scraper\LinkedinScraper.js:533:13)
at async Promise.all (index 0) at async C:\projetos\analise_vagas\linkedin_scraper\linkedin_run.js:79:5 -- ASYNC -- at ExecutionContext. (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\helper.js:116:19) at DOMWorld.evaluate (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\DOMWorld.js:108:24) -- ASYNC -- at Frame. (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\helper.js:116:19) at Page.evaluate (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\Page.js:680:14) at Page. (C:\projetos\analise_vagas\linkedin_scraper\node_modules\puppeteer\lib\helper.js:117:27) at _run (C:\projetos\analise_vagas\linkedin_scraper\node_modules\linkedin-jobs-scraper\scraper\LinkedinScraper.js:345:76) at async LinkedinScraper.run (C:\projetos\analise_vagas\linkedin_scraper\node_modules\linkedin-jobs-scraper\scraper\LinkedinScraper.js:533:13)
at async Promise.all (index 0) at async C:\projetos\analise_vagas\linkedin_scraper\linkedin_run.js:79:5

The parameters I am running it with are:

    const descriptionProcessor = () => document.querySelector(".description__text")
            .innerText
            .replace(/[\s\n\r]+/g, " ")
            .trim();

   // Run queries concurrently
    await Promise.all([
        scraper.run(
            ["Developer","Desenvolvedor","Software Engineer"],
            ["São Paulo e Região", "Rio de Janeiro e Região", "Curitiba e Região", "Florianópolis e Região"],
            {
                paginationMax: 8,
                descriptionProcessor
            }
        )
    ]);

The code is trying to perform a getAttribute on an element it finds as undefined. The exception is not being captured by the software, for some reason. Would you have any ideas on how to solve this behavior?

Thank you

spinlud commented 4 years ago

Hi @rafaelmbsouza, thanks! I am glad you found it useful. Usually that error happens when puppeteer try to access stale data from the chromium driver or when a Linkedin job has corrupted data. I've released a possible fix for this in v1.4.2, it should catch the error and prevent the program to terminate abruptly. Let me know!

rafaelmbsouza commented 4 years ago

I made a few more tests pulling the modifications you committed recently and here is what I found: apparently, if I use a list of jobtitles, the scraping works fine for a single location (I am able to pull 1000+ job postings). When I try different jobtitles combined with different locations, the exception is not caught pretty quickly. I have the Hypothesis that the exception happens when results are over for a given location, or something like this. Also, I noticed that Linkedin changes the "show more results" button after a few iterations (from scrolling to clicking). I am not sure if this was mapped in your code, as I did not have the time or the expertise with puppeteer to review it.

Anyway, your modifications enhanced my experience significantly using the module, and I will try to contribute to it later with a few pull requests once I understand the edge cases.

spinlud commented 4 years ago

I made a few more tests pulling the modifications you committed recently and here is what I found: apparently, if I use a list of jobtitles, the scraping works fine for a single location (I am able to pull 1000+ job postings). When I try different jobtitles combined with different locations, the exception is not caught pretty quickly. I have the Hypothesis that the exception happens when results are over for a given location, or something like this. Also, I noticed that Linkedin changes the "show more results" button after a few iterations (from scrolling to clicking). I am not sure if this was mapped in your code, as I did not have the time or the expertise with puppeteer to review it.

Anyway, your modifications enhanced my experience significantly using the module, and I will try to contribute to it later with a few pull requests once I understand the edge cases.

The actual behaviour is to scroll down and wait for the see more jobs button to appear, click it and wait until the count of job elements <li> increase. There is also a timeout to avoid to wait forever (no more jobs or other unexpected condition). I found this to work in the majority of the cases, but of course it could be improved. Mind also that Linkedin does not like very much scraping of their websites (even for the public content) and if you push concurrency too much it is very likely your ip will be blocked for some time and you will start experiencing failures in page load. Anyway PRs are more the welcome! 😎

Jajafarov commented 4 years ago

It seems like this issue still persists in the newest version. Trying to scrape a larger number of jobs leads to the error occurring and no more jobs being scraped even though the program doesn't terminate. Are there ways to work around this?

spinlud commented 4 years ago

Hi! Could you provide the code for the query causing the issue?

Jajafarov commented 4 years ago

This is the simplest version that leads to this problem after about 40 jobs found.


    // Programatically disable logger
    setTimeout(() => LinkedinScraper.disableLogger(), 5000);

    const scraper = new LinkedinScraper({
        headless: true,
        slowMo: 10,
    });

    // Listen for custom events
    scraper.on(events.custom.data, ({
        query,
        location,
        link,
        title,
        company,
        place,
        date,
        description,
        senorityLevel,
        jobFunction,
        employmentType,
        industries 
    }) => {
    console.log(title)
    });

    scraper.on(events.custom.error, (err) => { console.error(err); });
    scraper.on(events.custom.end, () => {
    console.log('All done')
    });

    // Listen for puppeteer specific browser events
    scraper.on(events.puppeteer.browser.targetcreated, () => { });
    scraper.on(events.puppeteer.browser.targetchanged, () => { });
    scraper.on(events.puppeteer.browser.targetdestroyed, () => { });
    scraper.on(events.puppeteer.browser.disconnected, () => { });

    // Run queries concurrently
    await Promise.all([
        scraper.run(
            ['Sales'],
            ['United Kingdom'],
            {
                paginationMax: 1000,
            }
        ),
    ]);

    // Close browser
    await scraper.close();
})();
spinlud commented 4 years ago

I think I've found the problem. Could you try the latest version 1.6.0 and let me know?

Jajafarov commented 4 years ago

The new update seems to have fixed it. Thank you.

artjoms-formulevics commented 4 years ago

Hi. A big thank you for creating and maintaining this scraper! I am using the latest version, yet it still leads to error, when there are many jobs.

Example variables I am using (with other parameters as in standard basic example):

var queries = ["Java", "Python", "C#"]
var locations = ["Germany"]
var pag = 10

I am still getting a lot of errors:

 at __puppeteer_evaluation_script__:12:87
[Java][Germany][9]  Evaluation failed: TypeError: Cannot read property 'getAttribute' of undefined

and then:

at __puppeteer_evaluation_script__:12:87
[Java][Germany][9]  Timeout on fetching more jobs

I am getting 630 results of (expectedly) 750. (If I am not mistaken, 25 posts per page x 10 loads x 3 queries x 1 location)

spinlud commented 4 years ago

Hi @artjoms-formulevics! I tried to run your query but I was unable to reproduce the issue. I got 749/750 jobs processed. There was one timeout on a job but it is expected during a long execution to get one or more timeout (the root cause can be many things, e.g. too much load on the server, unstable connection, etc).

image

This is the code I have used:

const { LinkedinScraper, events, } = require("linkedin-jobs-scraper");

(async () => {
    // Each scraper instance is associated with one browser.
    // Concurrent queries will be runned on different pages within the same browser instance.
    const scraper = new LinkedinScraper({
        headless: true,
        slowMo: 10,
    });

    // Listen for custom events
    scraper.on(events.custom.data, ({ query, location, link, title, company, place, description, date }) => {
        console.log(
            description.length,
            // `Query='${query}'`,
            `Title='${title}'`,
            `Location='${location}'`,
            `Company='${company}'`,
            `Place='${place}'`,
            `Date='${date}'`,
            `Link='${link}'`,
        );
    });

    scraper.on(events.custom.error, (err) => { console.error(err); });
    scraper.on(events.custom.end, () => { });

    // Listen for puppeteer specific browser events
    scraper.on(events.puppeteer.browser.targetcreated, () => { });
    scraper.on(events.puppeteer.browser.targetchanged, () => { });
    scraper.on(events.puppeteer.browser.targetdestroyed, () => { });
    scraper.on(events.puppeteer.browser.disconnected, () => { });

    // Run queries concurrently
    await Promise.all([
        scraper.run(
            ["Java", "Python", "C#"],
            ["Germany"],
            {
                paginationMax: 10,
            }
        ),
    ]);

    // Close browser
    await scraper.close();
})();

You can also try to increase the value of slowMo, it will slow down browser operations by the specified milliseconds making the interaction more human and less prone to make the server angry (but of course it will slow down the processing time). You can read more about here.

const scraper = new LinkedinScraper({
    headless: true,
    slowMo: 20,
});
artjoms-formulevics commented 4 years ago

@spinlud,

Thanks for the prompt reply! You are right, that is probably something to deal with connection or servers. Not scripts fault anyways. And increasing slowMo param absolutely helped. Thanks a lot & sorry for the false accusation! ;)