networkupstools / nut-website

Network UPS Tools website and protocol library
6 stars 14 forks source link

Want to sanity-check the generated nut-website #52

Open jimklimov opened 6 months ago

jimklimov commented 6 months ago

I have a FOSS project whose web site is generated by asciidoc and some custom scripts as an horde (thousands) of static files locally in the source files' repo, copied into another workspace and uploaded to github.io style repository, and eventually is rendered as an HTTP server for browsers around the world to see.

Users occasionally report that some of the links between site pages end up broken (lead nowhere).

The website build platform is generally POSIX-ish, although most often the agent doing the regular work is a Debian/Linux one. Maybe the platform differences cause the "page outages"; maybe this bug is platform-independent.

I had a thought about crafting a check for the two local directories as well as the resulting site to crawl all relative links (and/or absolute ones starting with its domain name(s)), and report any broken pages so I could focus on finding why they fail and/or avoiding publication of "bad" iterations - same as with compilers, debuggers and warnings elsewhere.

The general train of thought is about using some wget spider mode, though any other command-line tool (curl, lynx...), python script, shell with sed, etc. would do as well. Surely this particular wheel has been invented too many times for me to even think about making my own? A quick and cursory googling session while on commute did not come up with any good fit however.

So, suggestions are welcome :)

Posted as a question at https://unix.stackexchange.com/questions/775994/how-to-check-consistency-of-a-generated-web-site-using-recursive-html-parsing

jimklimov commented 6 months ago

One promising suggestion was https://github.com/gjtorikian/html-proofer (packaged in Debian in ruby-html-proofer) - it does at least report a few hundred issues, inside the site and outside it (with third-party pages we refer to), so definitely something for me to chew on :)

jimklimov commented 6 months ago

Ran with htmlproofer analysis of current generated site - took several hours to produce a number of complaints to handle subsequently: https://ci.networkupstools.org/view/InfraTasks/job/nut-website/6193/pipeline-console/?selected-node=13

Many complaints about internal anchored links in output/protocols/apcsmart.html

Several other documents have their own misnamed(?) anchor links, as well as bad original asciidoc tags pointing to nearby documents (can be good for text browsing, but apparently not for resulting HTML).

Selected example complaints (overall there are about 700, although many patterns repeat in different locations):

- /home/jim/nut-website/networkupstools.github.io/docs/FAQ.html
  *  internally linking to UPGRADING, which does not exist (line 2)
     <a class="ulink" href="UPGRADING" target="_top">UPGRADING</a>
  *  internally linking to docbook-xsl.css, which does not exist (line 2)
     <link rel="stylesheet" type="text/css" href="docbook-xsl.css">

  *  internally linking to docs/config-notes.txt, which does not exist (line 86)
     <a class="ulink" href="docs/config-notes.txt" target="_top">docs/config-notes.txt</a>
  *  internally linking to https://www.networkupstools.org/cables/940-0024C.jpg, which does not exist (line 68)
     <a class="ulink" href="https://www.networkupstools.org/cables/940-0024C.jpg" target="_top">https://www.networkupstools.org/cables/940-0024C.jpg</a>
  *  internally linking to scheduling.txt, which does not exist (line 245)
     <a class="ulink" href="scheduling.txt" target="_top">scheduling.txt</a>
  *  internally linking to security.txt, which does not exist (line 236)
     <a class="ulink" href="security.txt" target="_top">security.txt</a>
  *  internally linking to security.txt, which does not exist (line 240)
     <a class="ulink" href="security.txt" target="_top">security.txt</a>
  *  internally linking to upssched.txt, which does not exist (line 257)
     <a class="ulink" href="upssched.txt" target="_top">upssched.txt</a>
- /home/jim/nut-website/networkupstools.github.io/docs/developer-guide.chunked/ar01s02.html
  *  internally linking to protocol.txt, which does not exist (line 19)
     <a class="ulink" href="protocol.txt" target="_top">protocol.txt</a>
  *  internally linking to sock-protocol.txt, which does not exist (line 15)
     <a class="ulink" href="sock-protocol.txt" target="_top">sock-protocol.txt</a>
- /home/jim/nut-website/networkupstools.github.io/docs/developer-guide.chunked/ar01s03.html
  *  internally linking to NEWS, which does not exist (line 690)
     <a class="ulink" href="NEWS" target="_top">NEWS</a>
  *  internally linking to UPGRADING, which does not exist (line 692)
     <a class="ulink" href="UPGRADING" target="_top">UPGRADING</a>
  *  internally linking to ci-farm-lxc-setup.txt, which does not exist (line 208)
     <a class="ulink" href="ci-farm-lxc-setup.txt" target="_top">ci-farm-lxc-setup.txt</a>
[2024-05-08T21:23:30.418Z] - output/nut-qa.html
[2024-05-08T21:23:30.418Z]   *  linking to internal hash #NUT_Security that does not exist (line 314)
[2024-05-08T21:23:30.418Z]      <a href="user-manual.html#NUT_Security">security features</a>

Note links to txt not html:

[2024-05-08T21:23:30.415Z] - output/docs/user-manual.chunked/_setting_up_the_multi_arch_linux_lxc_container_farm_for_nut_ci.html
[2024-05-08T21:23:30.415Z]   *  internally linking to config-prereqs.txt, which does not exist (line 338)
[2024-05-08T21:23:30.415Z]      <a class="ulink" href="config-prereqs.txt" target="_top">config-prereqs.txt</a>

[2024-05-08T21:23:30.417Z] - output/documentation.html
[2024-05-08T21:23:30.417Z]   *  linking to internal hash #Developer_man that does not exist (line 143)
[2024-05-08T21:23:30.417Z]      <a href="docs/man/index.html#Developer_man">Developer manual pages</a>

Hordes of apcsmart protocol links in particular:

[2024-05-08T21:23:30.419Z] - output/protocols/apcsmart.html
[2024-05-08T21:23:30.419Z]   *  linking to internal hash #@ that does not exist (line 729)
[2024-05-08T21:23:30.419Z]      <a href="#@"><strong></strong></a>
...
[2024-05-08T21:23:30.420Z]   *  linking to internal hash #B that does not exist (line 508)
[2024-05-08T21:23:30.420Z]      <a href="#B">actual voltage</a>
...
[2024-05-08T21:23:30.421Z]      <a href="#D">calibrated</a>
[2024-05-08T21:23:30.421Z]   *  linking to internal hash #D that does not exist (line 558)
...
jimklimov commented 6 months ago

At least some of the "internal hash" issues can be false-positives of the tool, see https://github.com/gjtorikian/html-proofer/issues/819

Also of note: 3.14.x and 3.19.x versions on the Debian 12 and Ubuntu 22 workers tried so far are quite behind the current development (5.0.9 at the moment) which "saved" us from some other false positives but generally constrains available features.

Not sure if newer versions have anything about parallel processing performance, but with 3.1x.y ones here I can't get it to happen. FWIW, question posted at https://github.com/gjtorikian/html-proofer/issues/840

jimklimov commented 6 months ago

Custom-building the tool seems possible, but may require a newer ruby (>= 3.1 < 4.0) to run.

Ruby custom install per:

:; bundle install


This makes the built proffer (and its dependencies) available in user's local shim env:

:; which htmlproofer /home/jim/.asdf/shims/htmlproofer

:; htmlproofer --version 5.0.9