Open mmuehlfeldRH opened 11 months ago
Hey! In short: for non-owner the https://access.redhat.com site will be checked slowly and gently, no matter your setting is.
TL;DR It looks like you are facing the built-on DDoS protection. Your server should return a specific header to allow link-checker to run as fast as possible.
In your test environment, you can deploy your site behind a local NGINX with the specific configuration, like
add_header LinkChecker "allow-concurrent-checks";
You also need to set a parameter by file like:
cat <<EOF | tee $THINGSBOARD_WEBSITE_DIR/linkcheckerrc
[checking]
# To use values greater than 10, the HTTP server must return a “LinkChecker” response header.
# https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#checking
maxrequestspersecond=1000
EOF
And finally, add a parameter to the link checker.
-f /tmp/linkcheckerrc
See: https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#checking
@smatvienko-tb thanks for this information. This was very helpful.
If it's an intentional feature do slow down the checking, this should be better highlighted in the documentation and not be hidden in a single sentence in a parameter description of a long man page. I'm surely not the only one who thought that linkchecker is very slow compared to other tools. :-)
Even if I understand the idea of this artificial slowdown, I vote for adding a command-line option to allow users to turn it off, because:
Users might be the owner of the content (like me) but have no administrative access to the web server.
The source code of linkchecker is available, and users can easily turn off the limitation:
In linkcheck/checker/httpurl.py, change:
if "LinkChecker" in self.headers:
self.aggregate.set_maxrated_for_host(self.urlparts[1])
to
# if "LinkChecker" in self.headers:
self.aggregate.set_maxrated_for_host(self.urlparts[1])
After changing these two lines, re-compile, and adding maxrequestspersecond
with a high value to linkcheckerrc, linkchecker is almost 6x faster:
$ time linkchecker --threads=100 -f /tmp/linkcheckerrc --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
1 thread active, 0 links queued, 0 links in 1 URL checked, runtime 1 seconds
3 threads active, 0 links queued, 154 links in 203 URLs checked, runtime 6 seconds
real 0m8.254s
user 0m6.166s
sys 0m0.507s
My recommendation: Keep the slowdown turned on by default, but add a command-line option to turn it off.
Just to make a point how stupid users can be: I did find the maxrequestspersecond
option, I did read the description and saw the default value of 10, but I did NOT see the comment about the LinkChecker
header.
RTFM, I know. Perhaps it should be RTFMSlowly...
I hope I'm making you laugh: I only saw the text when I wanted to create a PR to add it! Now it's a PR to highlight the description of the header: #796
And added a flag for the configuration file: #797
Summary
It makes no difference how many threads I specify, linkchecker always needs the same amount of time to complete.
Steps to reproduce
Actual result
Even if the number of threads is significantly increased, the check needs a similar amount of time
Expected result
If the number of threads is increased, linkchecker should finish faster.
Environment
Configuration