linkchecker / linkchecker

check links in web documents or full websites
https://linkchecker.github.io/linkchecker/
GNU General Public License v2.0
890 stars 147 forks source link

The number of threads has no impact #778

Open mmuehlfeldRH opened 11 months ago

mmuehlfeldRH commented 11 months ago

Summary

It makes no difference how many threads I specify, linkchecker always needs the same amount of time to complete.

Steps to reproduce

$ time linkchecker --threads=10 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
10 threads active,   146 links queued,    1 link in 157 URLs checked, runtime 1 seconds
10 threads active,   130 links queued,   17 links in 157 URLs checked, runtime 6 seconds
10 threads active,   115 links queued,   32 links in 157 URLs checked, runtime 11 seconds
10 threads active,    97 links queued,   50 links in 164 URLs checked, runtime 16 seconds
10 threads active,    79 links queued,   68 links in 179 URLs checked, runtime 21 seconds
10 threads active,    53 links queued,   94 links in 191 URLs checked, runtime 26 seconds
10 threads active,    42 links queued,  105 links in 195 URLs checked, runtime 31 seconds
10 threads active,    24 links queued,  123 links in 197 URLs checked, runtime 36 seconds
 3 threads active,     0 links queued,  154 links in 204 URLs checked, runtime 41 seconds

real    0m46.912s
user    0m6.654s
sys 0m0.341s
$ time linkchecker --threads=30 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
30 threads active,   124 links queued,    3 links in 157 URLs checked, runtime 1 seconds
30 threads active,   108 links queued,   19 links in 157 URLs checked, runtime 6 seconds
30 threads active,    95 links queued,   32 links in 157 URLs checked, runtime 11 seconds
30 threads active,    78 links queued,   49 links in 162 URLs checked, runtime 16 seconds
30 threads active,    59 links queued,   68 links in 176 URLs checked, runtime 21 seconds
30 threads active,    41 links queued,   86 links in 189 URLs checked, runtime 26 seconds
30 threads active,    25 links queued,  102 links in 193 URLs checked, runtime 31 seconds
30 threads active,    12 links queued,  115 links in 195 URLs checked, runtime 36 seconds
22 threads active,     0 links queued,  135 links in 199 URLs checked, runtime 41 seconds

real    0m46.020s
user    0m6.551s
sys 0m0.382s
$ time linkchecker --threads=100 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
 1 thread active,     0 links queued,    0 links in   1 URL checked, runtime 1 seconds
100 threads active,    35 links queued,   22 links in 157 URLs checked, runtime 6 seconds
100 threads active,    20 links queued,   37 links in 157 URLs checked, runtime 11 seconds
100 threads active,     8 links queued,   49 links in 160 URLs checked, runtime 16 seconds
89 threads active,     0 links queued,   68 links in 176 URLs checked, runtime 21 seconds
76 threads active,     0 links queued,   81 links in 184 URLs checked, runtime 26 seconds
62 threads active,     0 links queued,   95 links in 188 URLs checked, runtime 31 seconds
43 threads active,     0 links queued,  114 links in 195 URLs checked, runtime 36 seconds
22 threads active,     0 links queued,  135 links in 199 URLs checked, runtime 41 seconds

real    0m45.242s
user    0m7.048s
sys 0m0.564s
$ time linkchecker --threads=200 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
146 threads active,     0 links queued,   11 links in 157 URLs checked, runtime 1 seconds
131 threads active,     0 links queued,   26 links in 157 URLs checked, runtime 6 seconds
117 threads active,     0 links queued,   40 links in 157 URLs checked, runtime 11 seconds
102 threads active,     0 links queued,   55 links in 170 URLs checked, runtime 16 seconds
89 threads active,     0 links queued,   68 links in 176 URLs checked, runtime 21 seconds
76 threads active,     0 links queued,   81 links in 184 URLs checked, runtime 26 seconds
62 threads active,     0 links queued,   95 links in 188 URLs checked, runtime 31 seconds
46 threads active,     0 links queued,  111 links in 189 URLs checked, runtime 36 seconds
28 threads active,     0 links queued,  129 links in 191 URLs checked, runtime 41 seconds
15 threads active,     0 links queued,  142 links in 196 URLs checked, runtime 46 seconds

real    0m49.140s
user    0m7.716s
sys 0m0.679s

Actual result

Even if the number of threads is significantly increased, the check needs a similar amount of time

Expected result

If the number of threads is increased, linkchecker should finish faster.

Environment

Configuration

linkchecker -Dcmdline --threads=100 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
DEBUG linkcheck.cmdline 2023-12-07 10:52:17,674 MainThread Python 3.12.0 (main, Oct  2 2023, 00:00:00) [GCC 13.2.1 20230918 (Red Hat 13.2.1-3)] on linux
DEBUG linkcheck.cmdline 2023-12-07 10:52:17,681 MainThread configuration: [('aborttimeout', 300),
 ('allowedschemes', []),
 ('authentication', []),
 ('checkextern', True),
 ('cookiefile', None),
 ('csv', {'encoding': 'utf-8', 'filename': '/tmp/output.csv', 'fileoutput': 1}),
 ('debugmemory', False),
 ('dot', {}),
 ('enabledplugins', []),
 ('externlinks', []),
 ('failures', {}),
 ('fileoutput', ['CSVLogger']),
 ('gml', {}),
 ('gxml', {}),
 ('html', {}),
 ('ignoreerrors', []),
 ('ignorewarnings', []),
 ('internlinks', []),
 ('localwebroot', None),
 ('logger', 'NoneLogger'),
 ('loginextrafields', {}),
 ('loginpasswordfield', 'password'),
 ('loginurl', None),
 ('loginuserfield', 'login'),
 ('maxfilesizedownload', 5242880),
 ('maxfilesizeparse', 1048576),
 ('maxhttpredirects', 10),
 ('maxnumurls', None),
 ('maxrequestspersecond', 10),
 ('maxrunseconds', None),
 ('nntpserver', None),
 ('none', {}),
 ('output', 'text'),
 ('pluginfolders', []),
 ('quiet', False),
 ('recursionlevel', 1),
 ('resultcachesize', 100000),
 ('robotstxt', True),
 ('sitemap', {}),
 ('sql', {}),
 ('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
 ('status', True),
 ('status_wait_seconds', 5),
 ('text', {}),
 ('threads', 100),
 ('timeout', 60),
 ('trace', False),
 ('useragent',
  'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
  '+https://linkchecker.github.io/linkchecker/)'),
 ('verbose', False),
 ('warnings', True),
 ('xml', {})]
100 threads active,    47 links queued,   10 links in 157 URLs checked, runtime 1 seconds
100 threads active,    32 links queued,   25 links in 157 URLs checked, runtime 6 seconds
100 threads active,    18 links queued,   39 links in 157 URLs checked, runtime 11 seconds
100 threads active,     4 links queued,   53 links in 169 URLs checked, runtime 16 seconds
88 threads active,     0 links queued,   69 links in 176 URLs checked, runtime 21 seconds
72 threads active,     0 links queued,   85 links in 184 URLs checked, runtime 26 seconds
54 threads active,     0 links queued,  103 links in 189 URLs checked, runtime 31 seconds
32 threads active,     0 links queued,  125 links in 196 URLs checked, runtime 36 seconds
14 threads active,     0 links queued,  143 links in 200 URLs checked, runtime 41 seconds
smatvienko-tb commented 11 months ago

Hey! In short: for non-owner the https://access.redhat.com site will be checked slowly and gently, no matter your setting is.

TL;DR It looks like you are facing the built-on DDoS protection. Your server should return a specific header to allow link-checker to run as fast as possible.

In your test environment, you can deploy your site behind a local NGINX with the specific configuration, like

add_header LinkChecker "allow-concurrent-checks";

You also need to set a parameter by file like:

cat <<EOF | tee $THINGSBOARD_WEBSITE_DIR/linkcheckerrc
[checking]
# To use values greater than 10, the HTTP server must return a “LinkChecker” response header. 
# https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#checking
maxrequestspersecond=1000
EOF

And finally, add a parameter to the link checker.

-f /tmp/linkcheckerrc

See: https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#checking

mmuehlfeldRH commented 10 months ago

@smatvienko-tb thanks for this information. This was very helpful.

If it's an intentional feature do slow down the checking, this should be better highlighted in the documentation and not be hidden in a single sentence in a parameter description of a long man page. I'm surely not the only one who thought that linkchecker is very slow compared to other tools. :-)

Even if I understand the idea of this artificial slowdown, I vote for adding a command-line option to allow users to turn it off, because:

  1. Users might be the owner of the content (like me) but have no administrative access to the web server.

  2. The source code of linkchecker is available, and users can easily turn off the limitation:

    In linkcheck/checker/httpurl.py, change:

        if "LinkChecker" in self.headers:
            self.aggregate.set_maxrated_for_host(self.urlparts[1])

    to

        # if "LinkChecker" in self.headers:
        self.aggregate.set_maxrated_for_host(self.urlparts[1])

After changing these two lines, re-compile, and adding maxrequestspersecond with a high value to linkcheckerrc, linkchecker is almost 6x faster:

$ time linkchecker --threads=100 -f /tmp/linkcheckerrc --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/managing_networking_infrastructure_services/index
 1 thread active,     0 links queued,    0 links in   1 URL checked, runtime 1 seconds
 3 threads active,     0 links queued,  154 links in 203 URLs checked, runtime 6 seconds

real    0m8.254s
user    0m6.166s
sys 0m0.507s

My recommendation: Keep the slowdown turned on by default, but add a command-line option to turn it off.

ineiti commented 9 months ago

Just to make a point how stupid users can be: I did find the maxrequestspersecond option, I did read the description and saw the default value of 10, but I did NOT see the comment about the LinkChecker header.

RTFM, I know. Perhaps it should be RTFMSlowly...

I hope I'm making you laugh: I only saw the text when I wanted to create a PR to add it! Now it's a PR to highlight the description of the header: #796

And added a flag for the configuration file: #797