mwpenny / kijiji-scraper

A lightweight node.js module for retrieving and scraping ads from Kijiji
MIT License
95 stars 43 forks source link

Exception thrown if there is partner's ad between kijiji ads #18

Closed riojung closed 5 years ago

riojung commented 5 years ago

Hello,

I am having issue of scrapping kijiji job ads with kijiji-scraper. it works well with other type of ads but if there are partner's ads between kijiji ads, it throws exception.

Here is code that I tried:

const kijiji = require("kijiji-scraper");

let options = {
    minResults: 40,
    maxResults: -1,
    keywords: "part time"
};

// https://www.kijiji.ca/b-part-time-student-jobs/calgary/c59l1700199
let params = {
    locationId: kijiji.locations.ALBERTA.CALGARY,
    categoryId: kijiji.categories.JOBS,
    sortByName: "dateAsc"
};

kijiji.search(params, options).then(function(ads) {
    // Use the ads array
    for (let i = 0; i < ads.length; ++i) {
        console.log('*Ad# ' + (i+1).toString());
        console.log(ads[i].toString());
        console.log('==========================================================================')
    }
}).catch(console.error);

And here is error for above code:

{ FetchError: request to https://www.kijiji.cahttps//www.ziprecruiter.com/clk/randstad-quebec-00000000-dual-ticket-millwright-electrician-7914_022695699?clk=J8r4wWhuP1tKESKoqRxo7xs8qGgqL7HPWR5y2wvYRChGoq_3bMoe83zQfitL_n7cts9PdPbKc2yKLleLEh4vRo1ggbhdJmtKbKYUKLpaKDlgUGLrCblfhk1UPvWJLKlw_h8rSEvPTTJIQ7VAMjpETPY_x0p-8H24nGUuGR7snq_GJcZ84Fhz6RiRxzbxItrlkeznWa4quL7aL9ZlB7myV2Q5DoUVtbf8JlTcTX-WWbSrfz1_d30bsPVwuPbhMY2_NS9aOq-4VodTlcWCF33VisQ1H8dRRDyryQSmZchrhxENiNpMHsmVtOrg8Ikb7-XutxioKDoyfiokIwirFnct9az8GOEAQwUCBe8X0VjXjX8GEB5RLcHVvSNJhzg7ZM4kj7FyRurgBUb6BZTCXzjmcUxU4EAw-Me_2eIpToOwO9GQRoMgZUcpFD8JQSxw2h9zKrpoT6I0r4AEA_43588plfsLwTcY453Rpyk5Is60pKkL_KKssmKDdrCYBhuzHz1naG4PfLrz-jRT-jXHBUAnrCV1czx5CsEWtoi3DuVybV4.9e5d32cbf3aaf685c73bb89c76eb0f30 failed, reason: getaddrinfo ENOTFOUND www.kijiji.cahttps www.kijiji.cahttps:443
    at ClientRequest.<anonymous> (/Users/wojung/partimer/node_modules/node-fetch/lib/index.js:1358:11)
    at emitOne (events.js:116:13)
    at ClientRequest.emit (events.js:211:7)
    at TLSSocket.socketErrorListener (_http_client.js:387:9)
    at emitOne (events.js:116:13)
    at TLSSocket.emit (events.js:211:7)
    at emitErrorNT (internal/streams/destroy.js:64:8)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickCallback (internal/process/next_tick.js:180:9)
  message: 'request to https://www.kijiji.cahttps//www.ziprecruiter.com/clk/randstad-quebec-00000000-dual-ticket-millwright-electrician-7914_022695699?clk=J8r4wWhuP1tKESKoqRxo7xs8qGgqL7HPWR5y2wvYRChGoq_3bMoe83zQfitL_n7cts9PdPbKc2yKLleLEh4vRo1ggbhdJmtKbKYUKLpaKDlgUGLrCblfhk1UPvWJLKlw_h8rSEvPTTJIQ7VAMjpETPY_x0p-8H24nGUuGR7snq_GJcZ84Fhz6RiRxzbxItrlkeznWa4quL7aL9ZlB7myV2Q5DoUVtbf8JlTcTX-WWbSrfz1_d30bsPVwuPbhMY2_NS9aOq-4VodTlcWCF33VisQ1H8dRRDyryQSmZchrhxENiNpMHsmVtOrg8Ikb7-XutxioKDoyfiokIwirFnct9az8GOEAQwUCBe8X0VjXjX8GEB5RLcHVvSNJhzg7ZM4kj7FyRurgBUb6BZTCXzjmcUxU4EAw-Me_2eIpToOwO9GQRoMgZUcpFD8JQSxw2h9zKrpoT6I0r4AEA_43588plfsLwTcY453Rpyk5Is60pKkL_KKssmKDdrCYBhuzHz1naG4PfLrz-jRT-jXHBUAnrCV1czx5CsEWtoi3DuVybV4.9e5d32cbf3aaf685c73bb89c76eb0f30 failed, reason: getaddrinfo ENOTFOUND www.kijiji.cahttps www.kijiji.cahttps:443',
  type: 'system',
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND' }
riojung commented 5 years ago

I wonder that there is way to skip to scrap partner ads? or scraping with partner ads without any issue?

mwpenny commented 5 years ago

Hm, interesting. That URL looks like 2 concatenated together. I have never seen this error before with this. I'll take a look when I have some free time later this week. Thanks for the detailed description!

riojung commented 5 years ago

ok, thanks @mwpenny

mwpenny commented 5 years ago

The malformed URL is because the scraper expects URLs on the search results page to be relative to https://www.kijiji.ca (third party ads are not of course). The larger problem is that even if external ad HTML was fetched correctly, the markup would be different than a Kijiji ad and the scraper would fail at that step.

Since third party ad markup is unpredictable, they will not be supported by kijiji-scraper. These ads will be excluded from results returned by search().

Please pull the master branch and try again.


Additionally, while looking into this I found that ad titles and post dates were not scraped properly when search() was used with scrapeResultDetails: false (Kijiji changed the markup of the search results page slightly). I have fixed that as well.

riojung commented 5 years ago

Awesome! thanks @mwpenny. I will pull the latest code from master and re-test it again.

mwpenny commented 5 years ago

@riojung Has the issue been resolved for you? I'd like to close this.

riojung commented 5 years ago

@mwpenny yes, this issue is fixed with your commit 3f71a9ab3f9b0b08eef417d932e0632c48e983bd I think you could close this one. thanks for the fix.

mwpenny commented 5 years ago

I have published the fixed version to NPM.