stevenvachon / broken-link-checker

Find broken links, missing images, etc within your HTML.
MIT License
1.95k stars 302 forks source link
html5 http link-checker links nodejs seo urls whatwg

broken-link-checker NPM Version Build Status Coverage Status Dependency Monitor

Find broken links, missing images, etc within your HTML.

Other features:

Installation

Node.js >= 14 is required. There're two ways to use it:

Command Line Usage

To install, type this at the command line:

npm install broken-link-checker -g

After that, check out the help for available options:

blc --help

A typical site-wide check might look like:

blc http://yoursite.com -ro
# or
blc path/to/index.html -ro

Note: HTTP proxies are not directly supported. If your network is configured incorrectly with no resolution in sight, you could try using a container with proxy settings.

Programmatic API

To install, type this at the command line:

npm install broken-link-checker

The remainder of this document will assist you in using the API.

Classes

While all classes have been exposed for custom use, the one that you need will most likely be SiteChecker.

HtmlChecker

Scans an HTML document to find broken links. All methods from EventEmitter are available.

const {HtmlChecker} = require('broken-link-checker');

const htmlChecker = new HtmlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots) => {})
  .on('queue', () => {})
  .on('junk', (result) => {})
  .on('link', (result) => {})
  .on('complete', () => {});

htmlChecker.scan(html, baseURL);

Methods & Properties

Events

HtmlUrlChecker

Scans the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {HtmlUrlChecker} = require('broken-link-checker');

const htmlUrlChecker = new HtmlUrlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('end', () => {});

htmlUrlChecker.enqueue(pageURL, customData);

Methods & Properties

Events

SiteChecker

Recursively scans (crawls) the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {SiteChecker} = require('broken-link-checker');

const siteChecker = new SiteChecker(options)
  .on('error', (error) => {})
  .on('robots', (robots, customData) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('site', (error, siteURL, customData) => {})
  .on('end', () => {});

siteChecker.enqueue(siteURL, customData);

Methods & Properties

Events

Note: the filterLevel option is used for determining which links are recursive.

UrlChecker

Requests each queued URL to determine if they are broken. All methods from EventEmitter are available.

const {UrlChecker} = require('broken-link-checker');

const urlChecker = new UrlChecker(options)
  .on('error', (error) => {})
  .on('queue', () => {})
  .on('link', (result, customData) => {})
  .on('end', () => {});

urlChecker.enqueue(url, customData);

Methods & Properties

Events

Options

cacheMaxAge

Type: Number
Default Value: 3_600_000 (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

cacheResponses

Type: Boolean
Default Value: true
URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

excludedKeywords

Type: Array<String>
Default value: []
Will not check links that match the keywords and glob patterns within this list. The only wildcards supported are * and !.

This option does not apply to UrlChecker.

excludeExternalLinks

Type: Boolean
Default value: false
Will not check external links (different protocol and/or host) when true; relative links with a remote <base href> included.

This option does not apply to UrlChecker.

excludeInternalLinks

Type: Boolean
Default value: false
Will not check internal links (same protocol and host) when true.

This option does not apply to UrlChecker nor SiteChecker's crawler.

excludeLinksToSamePage

Type: Boolean
Default value: false
Will not check links to the same page; relative and absolute fragments/hashes included. This is only relevant if the cacheResponses option is disabled.

This option does not apply to UrlChecker.

filterLevel

Type: Number
Default value: 1
The tags and attributes that are considered links for checking, split into the following levels:

Recursive links have a slightly different filter subset. To see the exact breakdown of both, check out the tag map. <base href> is not listed because it is not a link, though it is always parsed.

This option does not apply to UrlChecker.

honorRobotExclusions

Type: Boolean
Default value: true
Will not scan pages that search engine crawlers would not follow. Such will have been specified with any of the following:

This option does not apply to UrlChecker.

includedKeywords

Type: Array<String>
Default value: []
Will only check links that match the keywords and glob patterns within this list, if any. The only wildcard supported is *.

This option does not apply to UrlChecker.

includeLink

Type: Function
Default value: link => true
A synchronous callback that is called after all other filters have been performed. Return true to include link (a Link) in the list of links to be checked, or return false to have it skipped.

This option does not apply to UrlChecker.

includePage

Type: Function
Default value: url => true
A synchronous callback that is called after all other filters have been performed. Return true to include url (a URL) in the list of pages to be crawled, or return false to have it skipped.

This option does not apply to UrlChecker nor HtmlUrlChecker.

maxSockets

Type: Number
Default value: Infinity
The maximum number of links to check at any given time.

maxSocketsPerHost

Type: Number
Default value: 2
The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.

rateLimit

Type: Number
Default value: 0
The number of milliseconds to wait before each request.

requestMethod

Type: String
Default value: 'head'
The HTTP request method used in checking links. If you experience problems, try using 'get', however the retryHeadFail option should have you covered.

retryHeadCodes

Type: Array<Number>
Default value: [405]
The list of HTTP status codes for the retryHeadFail option to reference.

retryHeadFail

Type: Boolean
Default value: true
Some servers do not respond correctly to a 'head' request method. When true, a link resulting in an HTTP status code listed within the retryHeadCodes option will be re-requested using a 'get' method before deciding that it is broken. This is only relevant if the requestMethod option is set to 'head'.

userAgent

Type: String
Default value: 'broken-link-checker/0.8.0 Node.js/14.16.0 (OS X; x64)' (or similar)
The HTTP user-agent to use when checking links as well as retrieving pages and robot exclusions.

Handling Broken/Excluded Links

A broken link will have an isBroken value of true and a reason code defined in brokenReason. A link that was not checked (emitted as 'junk') will have a wasExcluded value of true, a reason code defined in excludedReason and a isBroken value of null.

if (link.get('isBroken')) {
  console.log(link.get('brokenReason'));
  //-> HTTP_406
} else if (link.get('wasExcluded')) {
  console.log(link.get('excludedReason'));
  //-> BLC_ROBOTS
}

Additionally, more descriptive messages are available for each reason code:

const {reasons} = require('broken-link-checker');

console.log(reasons.BLC_ROBOTS);       //-> Robots exclusion
console.log(reasons.ERRNO_ECONNRESET); //-> connection reset by peer (ECONNRESET)
console.log(reasons.HTTP_404);         //-> Not Found (404)

// List all
console.log(reasons);

Putting it all together:

if (link.get('isBroken')) {
  console.log(reasons[link.get('brokenReason')]);
} else if (link.get('wasExcluded')) {
  console.log(reasons[link.get('excludedReason')]);
}

Finally, it is important to analyze links excluded with the BLC_UNSUPPORTED reason as it's possible for them to be broken.

Roadmap Features