stevenvachon / broken-link-checker

Find broken links, missing images, etc within your HTML.
MIT License
1.96k stars 304 forks source link

Unable To Link Check Dockerized WordPress Site #40

Open bensternthal opened 8 years ago

bensternthal commented 8 years ago

I think this is an edge case but since it happened to me... I would like to note this here. I'll try to dive in and who knows..maybe submit a pr.

Steps To Reproduce Run blc http://devpatch.com:3000 --filter-level 3 -ro

More Info I am running a dockerized version of a wordpress site. Testing both locally and the dev instance hosted on devpatch, the broken link checker never fetches or checks a page. Looking at the logs I see the request from BLC but that is it. Below is a screenshot. Left is log, Right is Console output.

I verified I could run BLC on a static non-docker hosted locally at the same port without issue.

screen shot 2016-07-15 at 11 19 25 am
bensternthal commented 8 years ago

I was able to get this to run by commenting out the following line:

link.html.location = node.__location.attrs[attrName];

https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/scrapeHtml.js#L34

I did not find references to this attribute in the code, so I am not sure what it is used for. I am also unsure why this would cause an issue. If i leave this in... return links; is never reached.

Still diagnosing what could be causing this.

bensternthal commented 8 years ago

Sooo, the above led me to find a mismatched a href in my code. Fixing that fixed this.

This might be a scenario where node.__location.attrs returning undefined throws an error.

Let me know what you think.

stevenvachon commented 8 years ago

Can you provide the HTML that caused the issue?

bensternthal commented 8 years ago

Here is an example snippet that will reproduce the error:

https://gist.github.com/bensternthal/e186520f239909b0ba52e861d01bfaca

The</a> on line 11 is causing the issue.

stevenvachon commented 8 years ago
var result = require("parse5").parse("<a href=test><div>text</a></div>", {locationInfo:true});
console.log(result.childNodes[0].childNodes[1].childNodes[0])

produces:

{ nodeName: 'a',
  tagName: 'a',
  attrs: [ { name: 'href', value: 'test' } ],
  namespaceURI: 'http://www.w3.org/1999/xhtml',
  childNodes: [],
  parentNode: {…},
  __location: 
   { line: 1,
     col: 1,
     startOffset: 0,
     endOffset: 26,
     attrs: { href: [Object] },
     startTag: { line: 1, col: 1, startOffset: 0, endOffset: 13, attrs: [Object] },
     endTag: { line: 1, col: 23, startOffset: 22, endOffset: 26 } } }

So the problem must not be the html parser. I'll look into this deeper when I find some time. Thank you for the snippet.

bensternthal commented 8 years ago

No prob, glad I can help. The module is very handy, many thanks for creating & maintaining it.