Closed rafaelmbsouza closed 3 years ago
Hi @rafaelmbsouza, thanks! I am glad you found it useful.
Usually that error happens when puppeteer
try to access stale data from the chromium driver or when a Linkedin job has corrupted data. I've released a possible fix for this in v1.4.2
, it should catch the error and prevent the program to terminate abruptly.
Let me know!
I made a few more tests pulling the modifications you committed recently and here is what I found: apparently, if I use a list of jobtitles, the scraping works fine for a single location (I am able to pull 1000+ job postings). When I try different jobtitles combined with different locations, the exception is not caught pretty quickly. I have the Hypothesis that the exception happens when results are over for a given location, or something like this. Also, I noticed that Linkedin changes the "show more results" button after a few iterations (from scrolling to clicking). I am not sure if this was mapped in your code, as I did not have the time or the expertise with puppeteer to review it.
Anyway, your modifications enhanced my experience significantly using the module, and I will try to contribute to it later with a few pull requests once I understand the edge cases.
I made a few more tests pulling the modifications you committed recently and here is what I found: apparently, if I use a list of jobtitles, the scraping works fine for a single location (I am able to pull 1000+ job postings). When I try different jobtitles combined with different locations, the exception is not caught pretty quickly. I have the Hypothesis that the exception happens when results are over for a given location, or something like this. Also, I noticed that Linkedin changes the "show more results" button after a few iterations (from scrolling to clicking). I am not sure if this was mapped in your code, as I did not have the time or the expertise with puppeteer to review it.
Anyway, your modifications enhanced my experience significantly using the module, and I will try to contribute to it later with a few pull requests once I understand the edge cases.
The actual behaviour is to scroll down and wait for the see more jobs
button to appear, click it and wait until the count of job elements <li>
increase. There is also a timeout to avoid to wait forever (no more jobs or other unexpected condition). I found this to work in the majority of the cases, but of course it could be improved. Mind also that Linkedin does not like very much scraping of their websites (even for the public content) and if you push concurrency too much it is very likely your ip will be blocked for some time and you will start experiencing failures in page load.
Anyway PRs are more the welcome! 😎
It seems like this issue still persists in the newest version. Trying to scrape a larger number of jobs leads to the error occurring and no more jobs being scraped even though the program doesn't terminate. Are there ways to work around this?
Hi! Could you provide the code for the query causing the issue?
This is the simplest version that leads to this problem after about 40 jobs found.
// Programatically disable logger
setTimeout(() => LinkedinScraper.disableLogger(), 5000);
const scraper = new LinkedinScraper({
headless: true,
slowMo: 10,
});
// Listen for custom events
scraper.on(events.custom.data, ({
query,
location,
link,
title,
company,
place,
date,
description,
senorityLevel,
jobFunction,
employmentType,
industries
}) => {
console.log(title)
});
scraper.on(events.custom.error, (err) => { console.error(err); });
scraper.on(events.custom.end, () => {
console.log('All done')
});
// Listen for puppeteer specific browser events
scraper.on(events.puppeteer.browser.targetcreated, () => { });
scraper.on(events.puppeteer.browser.targetchanged, () => { });
scraper.on(events.puppeteer.browser.targetdestroyed, () => { });
scraper.on(events.puppeteer.browser.disconnected, () => { });
// Run queries concurrently
await Promise.all([
scraper.run(
['Sales'],
['United Kingdom'],
{
paginationMax: 1000,
}
),
]);
// Close browser
await scraper.close();
})();
I think I've found the problem.
Could you try the latest version 1.6.0
and let me know?
The new update seems to have fixed it. Thank you.
Hi. A big thank you for creating and maintaining this scraper! I am using the latest version, yet it still leads to error, when there are many jobs.
Example variables I am using (with other parameters as in standard basic example):
var queries = ["Java", "Python", "C#"]
var locations = ["Germany"]
var pag = 10
I am still getting a lot of errors:
at __puppeteer_evaluation_script__:12:87
[Java][Germany][9] Evaluation failed: TypeError: Cannot read property 'getAttribute' of undefined
and then:
at __puppeteer_evaluation_script__:12:87
[Java][Germany][9] Timeout on fetching more jobs
I am getting 630 results of (expectedly) 750. (If I am not mistaken, 25 posts per page x 10 loads x 3 queries x 1 location)
Hi @artjoms-formulevics! I tried to run your query but I was unable to reproduce the issue. I got 749/750 jobs processed. There was one timeout on a job but it is expected during a long execution to get one or more timeout (the root cause can be many things, e.g. too much load on the server, unstable connection, etc).
This is the code I have used:
const { LinkedinScraper, events, } = require("linkedin-jobs-scraper");
(async () => {
// Each scraper instance is associated with one browser.
// Concurrent queries will be runned on different pages within the same browser instance.
const scraper = new LinkedinScraper({
headless: true,
slowMo: 10,
});
// Listen for custom events
scraper.on(events.custom.data, ({ query, location, link, title, company, place, description, date }) => {
console.log(
description.length,
// `Query='${query}'`,
`Title='${title}'`,
`Location='${location}'`,
`Company='${company}'`,
`Place='${place}'`,
`Date='${date}'`,
`Link='${link}'`,
);
});
scraper.on(events.custom.error, (err) => { console.error(err); });
scraper.on(events.custom.end, () => { });
// Listen for puppeteer specific browser events
scraper.on(events.puppeteer.browser.targetcreated, () => { });
scraper.on(events.puppeteer.browser.targetchanged, () => { });
scraper.on(events.puppeteer.browser.targetdestroyed, () => { });
scraper.on(events.puppeteer.browser.disconnected, () => { });
// Run queries concurrently
await Promise.all([
scraper.run(
["Java", "Python", "C#"],
["Germany"],
{
paginationMax: 10,
}
),
]);
// Close browser
await scraper.close();
})();
You can also try to increase the value of slowMo
, it will slow down browser operations by the specified milliseconds making the interaction more human and less prone to make the server angry (but of course it will slow down the processing time). You can read more about here.
const scraper = new LinkedinScraper({
headless: true,
slowMo: 20,
});
@spinlud,
Thanks for the prompt reply! You are right, that is probably something to deal with connection or servers. Not scripts fault anyways. And increasing slowMo param absolutely helped. Thanks a lot & sorry for the false accusation! ;)
First of all, thank you for the well-written code you brought together. Very useful and robust.
When running it from my computer for a few vacancies, I get one exception that interrupts the crawler, but I can't seem to find the reason for it.
The parameters I am running it with are:
The code is trying to perform a getAttribute on an element it finds as undefined. The exception is not being captured by the software, for some reason. Would you have any ideas on how to solve this behavior?
Thank you