Open mdykierek opened 4 years ago
we can't reproduce this without rls_notes.htm
I can sort-of reproduce. In an ubuntu:bionic container with apt-get installed linkchecker 9.3 I get
That's it. 21406 links in 1886 URLs checked. 0 warnings found. 4 errors found.
In my primary system (ubunty:focal) with git master I get
That's it. 1903 links in 1903 URLs checked. 0 warnings found. 1 error found.
The exact command I used was linkchecker -r -1 rls_notes.htm
.
The 4 errors are about the same broken link (pfv_overview.htm) that is referenced from four local files.
Perhaps a shorter example is to run linkchecker -r 1 -v rls_notes.html
.
Note how the new linkchecker always reports the same number of links as URLs?
Downloaded: 29.84KB. Content types: 3 image, 3 text, 0 video, 0 audio, 3 application, 0 mail and 2 other. URL lengths: min=20, max=57, avg=36.
That's it. 11 links in 11 URLs checked. 0 warnings found. 0 errors found.
while 9.3 reports
Downloaded: 29.84KB. Content types: 3 image, 7 text, 0 video, 0 audio, 3 application, 0 mail and 2 other. URL lengths: min=20, max=51, avg=31.
That's it. 15 links in 9 URLs checked. 0 warnings found. 0 errors found.
The -v shows us what those extra links are:
URL `#GUI_Preferences'
URL `#Macro_Commands'
URL `#Macro_Commands'
URL `#Design_Profiling'
which are just different anchors in the same page!
This raises a question: did old linkchecker actually verify that the anchors existed on the page? Does the new one not do that?
I cannot figure out what "9 URLs" refers to in the old linkchecker output. Maybe the number of documents it actually fetched to check? Which would exclude external URLs.
Why does new linkchecker report a different number of "URLs"?
This raises a question: did old linkchecker actually verify that the anchors existed on the page? Does the new one not do that?
I seem to recall we had some breakage around that at some point. I think we did check anchors, and i think it's reasonable to do so. Now we might not be doing that since the BS4 change? Just a guess...
Seems like a valid regression to investigate.
This seems to be a question of what changed between 9.3 and 9.4. Scanning the commits: eaa538c8 ("don't check one url multiple times", 2016-11-09)
With hosting the test page below as fragment.html and using -r1
, all 0 errors found:
9.3: 7 links in 1 URL checked 9.4: 1 link in 1 URL checked git: 1 link in 1 URL checked
I've created #459 to improve the debug logging.
However, the change does prevent the AnchorCheck plugin (which was released in 9.0) working, N.B. it reports warnings not errors. But that plugin doesn't seem to be working in 9.3 anyway. Raised #460.
<html>
<body>
<a href="fragment.html">a</a>
<a href="fragment.html#one">a1</a>
<a href="fragment.html#two">a2</a>
<a href="fragment.html#three">a3</a>
<a href="fragment.html#four">a4</a>
<a href="fragment.html#five">a5</a>
</body>
</html>
I'm wondering if eaa538c was a workaround for problems (threading?) with the existing code:
The AnchorCheck plugin has been re-enabled with fixes thanks to work from Nathan Arthur.
For fragment.html above with AnchorCheck now I get:
7 links in 7 URLs checked. 5 warnings found. 0 errors found.
Comparing back to v9 with Python 2 isn't possible any more.
If any problems are found please report them as new issues.
Alas, although AnchorCheck is fixed, confirmed that https://github.com/linkchecker/linkchecker/commit/eaa538c814f31ad86a84843cb1e7777c66370c2b has led to only the first instance of a broken link being reported. More on #663.
9.4.0 does not find all problems and scans less items
linkchecker -r -1 --output=HTML rls_notes.htm > linkchecker_ug.html
Actual result
That's it. 1903 links checked. 0 warnings found. 1 error found.
Expected result
That's it. 22858 links checked. 0 warnings found. 4 errors found.
Environment
Configuration file
Logs
Other notes