Open jimklimov opened 6 months ago
One promising suggestion was https://github.com/gjtorikian/html-proofer (packaged in Debian in ruby-html-proofer
) - it does at least report a few hundred issues, inside the site and outside it (with third-party pages we refer to), so definitely something for me to chew on :)
Ran with htmlproofer
analysis of current generated site - took several hours to produce a number of complaints to handle subsequently:
https://ci.networkupstools.org/view/InfraTasks/job/nut-website/6193/pipeline-console/?selected-node=13
Many complaints about internal anchored links in output/protocols/apcsmart.html
Several other documents have their own misnamed(?) anchor links, as well as bad original asciidoc tags pointing to nearby documents (can be good for text browsing, but apparently not for resulting HTML).
Selected example complaints (overall there are about 700, although many patterns repeat in different locations):
- /home/jim/nut-website/networkupstools.github.io/docs/FAQ.html
* internally linking to UPGRADING, which does not exist (line 2)
<a class="ulink" href="UPGRADING" target="_top">UPGRADING</a>
* internally linking to docbook-xsl.css, which does not exist (line 2)
<link rel="stylesheet" type="text/css" href="docbook-xsl.css">
* internally linking to docs/config-notes.txt, which does not exist (line 86)
<a class="ulink" href="docs/config-notes.txt" target="_top">docs/config-notes.txt</a>
* internally linking to https://www.networkupstools.org/cables/940-0024C.jpg, which does not exist (line 68)
<a class="ulink" href="https://www.networkupstools.org/cables/940-0024C.jpg" target="_top">https://www.networkupstools.org/cables/940-0024C.jpg</a>
* internally linking to scheduling.txt, which does not exist (line 245)
<a class="ulink" href="scheduling.txt" target="_top">scheduling.txt</a>
* internally linking to security.txt, which does not exist (line 236)
<a class="ulink" href="security.txt" target="_top">security.txt</a>
* internally linking to security.txt, which does not exist (line 240)
<a class="ulink" href="security.txt" target="_top">security.txt</a>
* internally linking to upssched.txt, which does not exist (line 257)
<a class="ulink" href="upssched.txt" target="_top">upssched.txt</a>
- /home/jim/nut-website/networkupstools.github.io/docs/developer-guide.chunked/ar01s02.html
* internally linking to protocol.txt, which does not exist (line 19)
<a class="ulink" href="protocol.txt" target="_top">protocol.txt</a>
* internally linking to sock-protocol.txt, which does not exist (line 15)
<a class="ulink" href="sock-protocol.txt" target="_top">sock-protocol.txt</a>
- /home/jim/nut-website/networkupstools.github.io/docs/developer-guide.chunked/ar01s03.html
* internally linking to NEWS, which does not exist (line 690)
<a class="ulink" href="NEWS" target="_top">NEWS</a>
* internally linking to UPGRADING, which does not exist (line 692)
<a class="ulink" href="UPGRADING" target="_top">UPGRADING</a>
* internally linking to ci-farm-lxc-setup.txt, which does not exist (line 208)
<a class="ulink" href="ci-farm-lxc-setup.txt" target="_top">ci-farm-lxc-setup.txt</a>
[2024-05-08T21:23:30.418Z] - output/nut-qa.html
[2024-05-08T21:23:30.418Z] * linking to internal hash #NUT_Security that does not exist (line 314)
[2024-05-08T21:23:30.418Z] <a href="user-manual.html#NUT_Security">security features</a>
Note links to txt
not html
:
[2024-05-08T21:23:30.415Z] - output/docs/user-manual.chunked/_setting_up_the_multi_arch_linux_lxc_container_farm_for_nut_ci.html
[2024-05-08T21:23:30.415Z] * internally linking to config-prereqs.txt, which does not exist (line 338)
[2024-05-08T21:23:30.415Z] <a class="ulink" href="config-prereqs.txt" target="_top">config-prereqs.txt</a>
[2024-05-08T21:23:30.417Z] - output/documentation.html
[2024-05-08T21:23:30.417Z] * linking to internal hash #Developer_man that does not exist (line 143)
[2024-05-08T21:23:30.417Z] <a href="docs/man/index.html#Developer_man">Developer manual pages</a>
Hordes of apcsmart protocol links in particular:
[2024-05-08T21:23:30.419Z] - output/protocols/apcsmart.html
[2024-05-08T21:23:30.419Z] * linking to internal hash #@ that does not exist (line 729)
[2024-05-08T21:23:30.419Z] <a href="#@"><strong></strong></a>
...
[2024-05-08T21:23:30.420Z] * linking to internal hash #B that does not exist (line 508)
[2024-05-08T21:23:30.420Z] <a href="#B">actual voltage</a>
...
[2024-05-08T21:23:30.421Z] <a href="#D">calibrated</a>
[2024-05-08T21:23:30.421Z] * linking to internal hash #D that does not exist (line 558)
...
At least some of the "internal hash" issues can be false-positives of the tool, see https://github.com/gjtorikian/html-proofer/issues/819
Also of note: 3.14.x and 3.19.x versions on the Debian 12 and Ubuntu 22 workers tried so far are quite behind the current development (5.0.9 at the moment) which "saved" us from some other false positives but generally constrains available features.
Not sure if newer versions have anything about parallel processing performance, but with 3.1x.y ones here I can't get it to happen. FWIW, question posted at https://github.com/gjtorikian/html-proofer/issues/840
Custom-building the tool seems possible, but may require a newer ruby (>= 3.1 < 4.0
) to run.
Ruby custom install per:
gem
tool from OS packages; 3.3.1 did not)
:; git clone -o upstream https://github.com/gjtorikian/html-proofer
:; cd html-proofer
:; gem build html-proofer.gemspec
:; gem install html-proofer-5.0.9.gem
This makes the built proffer (and its dependencies) available in user's local shim env:
:; which htmlproofer /home/jim/.asdf/shims/htmlproofer
:; htmlproofer --version 5.0.9
Posted as a question at https://unix.stackexchange.com/questions/775994/how-to-check-consistency-of-a-generated-web-site-using-recursive-html-parsing