stevenvachon / broken-link-checker

Find broken links, missing images, etc within your HTML.
MIT License
1.97k stars 305 forks source link

Allow to user to prevent enqueueing page in SiteChecker by custom function #147

Closed arthurvi closed 5 years ago

arthurvi commented 5 years ago

Is your feature request related to a problem? Please describe. Take for example the URL https://example.com/test.

I only want to crawl and check all pages for broken links in the test directory.

So I do want to crawl:

I don't want to crawl:

Describe the solution you'd like I want to be able to bring my own function to determine whether the SiteChecker should enqueue a page to crawl, yes or no.

Describe alternatives you've considered Currently only the excludes array is offered as option as far as I know.

Proposal I already implemented a solution that works for me. I'm willing to open a PR for it if there is demand, but I'm open for other proposals as well.

Working version here: https://github.com/arthurvi/broken-link-checker/blob/master/lib/public/SiteChecker.js#L106

I changed the link handler to:

link: function(result, customData) {
  const shouldEnqueueResult = maybeCallback(thisObj.handlers.link)(result, customData);
  if (shouldEnqueueResult !== false) {
    maybeEnqueuePage(thisObj, result, customData);
  }
},

Now you can return true/false in the link handler to indicate whether the SiteChecker should enqueue a page or not:

var siteChecker = new blc.SiteChecker(options, {
    robots: function(robots, customData){},
    html: function(tree, robots, response, pageUrl, customData){},
    junk: function(result, customData){},
    link: function(result, customData){},
    page: function(error, pageUrl, customData){},
    site: function(error, siteUrl, customData) {
             if (result.internal && !result.url.resolved.startsWith(customData.baseURL)) {
                 return false;
             }

             return true;
        },
    end: function(){}
});

It works, but a separate handler might be better. Any thoughts?

stevenvachon commented 5 years ago

Added to master branch (unreleased v0.8) as the includePage option.