gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.
MIT License
1.57k stars 196 forks source link

Want documentation (or even feature?) to parallelize local scans #840

Open jimklimov opened 4 months ago

jimklimov commented 4 months ago

I have a generated site with several thousand pages, and htmlproofer can take upwards of an hour of CPU churn to come up with verdicts... Using one CPU core for that, while others are idling.

From documentation (README) I see mentions of further projects involved in this one, such as "nokogiri", "typhoeus" or "hydra" (which say little to someone who does not deal with Ruby ecosystem). Per https://github.com/gjtorikian/html-proofer#configuring-typhoeus-and-hydra (and https://github.com/typhoeus/typhoeus itself) it seems that the project deals with parallel remote web server queries, while the "hydra" has a max_concurrency setting.

I can actually request the latter via CLI, but (as of htmlproofer 3.19.2 in Debian 12) this seems to have no effect - at least, the system is busy for over half an hour using one CPU core and did not even report the amount of pages it would parse:

:; time htmlproofer --disable-external --hydra-config='{"max_concurrency": 6}' ./networkupstools.github.io/
Running ["ImageCheck", "ScriptCheck", "LinkCheck"] on ["./networkupstools.github.io/"] on *.html...

### top:
%Cpu(s): 13.0 us,  0.5 sy,  0.0 ni, 71.6 id,  1.7 wa,  0.0 hi,  0.5 si, 12.7 st
...
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 940002 abuild    20   0  882620 785876  10360 R 100.0   9.7  37:33.86 htmlproofer

(note that the JSON here has to be really well formatted, with quoted strings for object item names -- not plain tokens as in Ruby examples of the README).

I have just recently started exploring this tool, so have no idea if it actually has support for parallelized local scans.

I'd expect it to read a page, toss links to neighboring pages into the queue (unless those were already "known" - queued, being handled, or fully processed) and go on with analysis of the current page, finally pick another page from the queue, rinse, repeat.

This would be quite welcoming for parallelization (with a synchronized access to the queue singleton), with many single-thread runners processing one page each and tossing links into the queue if they are yet "unknown" and returning to pick up another page from the queue.

Again - no idea if something of the sort already exists and needs just documentation, or needs to be designed and implemented as well. But in any case, it would be really welcome to cut a (multi-)hour scan to minutes, so it would be actually helpful in regular CI sanity checks vs. one-off developer trials.

jimklimov commented 4 months ago

UPDATE: Running a custom build of 5.0.9 (or rather current github master) I see the link and file counters appear much faster - kudos. Still, a hydra concurrency setting does not make more use of CPU cores. New-version wording of the command is:

:; time htmlproofer --disable-external --hydra='{"max_concurrency": 6}' ./networkupstools.github.io/