linkchecker / linkchecker

check links in web documents or full websites
https://linkchecker.github.io/linkchecker/
GNU General Public License v2.0
890 stars 147 forks source link

linkchecker 9.4 reports different numbers of links and errors to 9.3 #434

Open mdykierek opened 4 years ago

mdykierek commented 4 years ago

9.4.0 does not find all problems and scans less items

linkchecker -r -1 --output=HTML rls_notes.htm > linkchecker_ug.html

Actual result

That's it. 1903 links checked. 0 warnings found. 1 error found.

Expected result

That's it. 22858 links checked. 0 warnings found. 4 errors found.

Environment

Configuration file

Logs

Other notes

anarcat commented 4 years ago

we can't reproduce this without rls_notes.htm

mdykierek commented 4 years ago

An archive attached. ug.tar.gz

mgedmin commented 4 years ago

I can sort-of reproduce. In an ubuntu:bionic container with apt-get installed linkchecker 9.3 I get

That's it. 21406 links in 1886 URLs checked. 0 warnings found. 4 errors found.

In my primary system (ubunty:focal) with git master I get

That's it. 1903 links in 1903 URLs checked. 0 warnings found. 1 error found.

The exact command I used was linkchecker -r -1 rls_notes.htm.

The 4 errors are about the same broken link (pfv_overview.htm) that is referenced from four local files.

mgedmin commented 4 years ago

Perhaps a shorter example is to run linkchecker -r 1 -v rls_notes.html.

Note how the new linkchecker always reports the same number of links as URLs?

Downloaded: 29.84KB. Content types: 3 image, 3 text, 0 video, 0 audio, 3 application, 0 mail and 2 other. URL lengths: min=20, max=57, avg=36.

That's it. 11 links in 11 URLs checked. 0 warnings found. 0 errors found.

while 9.3 reports

Downloaded: 29.84KB. Content types: 3 image, 7 text, 0 video, 0 audio, 3 application, 0 mail and 2 other. URL lengths: min=20, max=51, avg=31.

That's it. 15 links in 9 URLs checked. 0 warnings found. 0 errors found.

The -v shows us what those extra links are:

URL        `#GUI_Preferences'
URL        `#Macro_Commands'
URL        `#Macro_Commands'
URL        `#Design_Profiling'

which are just different anchors in the same page!

This raises a question: did old linkchecker actually verify that the anchors existed on the page? Does the new one not do that?

I cannot figure out what "9 URLs" refers to in the old linkchecker output. Maybe the number of documents it actually fetched to check? Which would exclude external URLs.

Why does new linkchecker report a different number of "URLs"?

anarcat commented 4 years ago

This raises a question: did old linkchecker actually verify that the anchors existed on the page? Does the new one not do that?

I seem to recall we had some breakage around that at some point. I think we did check anchors, and i think it's reasonable to do so. Now we might not be doing that since the BS4 change? Just a guess...

Seems like a valid regression to investigate.

cjmayo commented 4 years ago

This seems to be a question of what changed between 9.3 and 9.4. Scanning the commits: eaa538c8 ("don't check one url multiple times", 2016-11-09)

With hosting the test page below as fragment.html and using -r1, all 0 errors found:

9.3: 7 links in 1 URL checked 9.4: 1 link in 1 URL checked git: 1 link in 1 URL checked

I've created #459 to improve the debug logging.

However, the change does prevent the AnchorCheck plugin (which was released in 9.0) working, N.B. it reports warnings not errors. But that plugin doesn't seem to be working in 9.3 anyway. Raised #460.

<html>
<body>
<a href="fragment.html">a</a>
<a href="fragment.html#one">a1</a>
<a href="fragment.html#two">a2</a>
<a href="fragment.html#three">a3</a>
<a href="fragment.html#four">a4</a>
<a href="fragment.html#five">a5</a>
</body>
</html>
cjmayo commented 4 years ago

I'm wondering if eaa538c was a workaround for problems (threading?) with the existing code:

https://github.com/linkchecker/linkchecker/blob/a977e4d7129450ba9fda8389724c80c1bde66883/linkcheck/cache/urlqueue.py#L126-L129

cjmayo commented 2 years ago

The AnchorCheck plugin has been re-enabled with fixes thanks to work from Nathan Arthur.

For fragment.html above with AnchorCheck now I get: 7 links in 7 URLs checked. 5 warnings found. 0 errors found.

Comparing back to v9 with Python 2 isn't possible any more.

If any problems are found please report them as new issues.

cjmayo commented 2 years ago

Alas, although AnchorCheck is fixed, confirmed that https://github.com/linkchecker/linkchecker/commit/eaa538c814f31ad86a84843cb1e7777c66370c2b has led to only the first instance of a broken link being reported. More on #663.