sypets / brofix

Check for broken links, forked from TYPO3 system extension linkvalidator
Other
6 stars 8 forks source link

Fix problem with false positives (external URLs) #13

Open sypets opened 3 years ago

sypets commented 3 years ago

todos

summary

So far, the following reasons for false positives could be verified:

  1. certificate chain issue (this is actually an error on the server side of the webserver which is checked, but it is a minor error and page can be loaded without warning in browser, so this is perceived (!) as not broken by user (this should be distinguished from other TLS security isssues, such as outdated certificate etc.)

    • error is usually curl 60 (can be verified by using curl -I "url" on server

      curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html

    • SSLLabs shows "chain issues: incomplete" and "extra download"

    • to fix on server side: put complete certificate chain in certifcate (including intermediate certificates)

    • to fix on client server side (where brofix is running): download intermediate certificates

  2. cloudflare

    • error 503 is given

problem description

some URLs are reported as errors even though they work (in browser)

Examples:

The 999 HTTP error is a Linkedin error. It happens when Linkedin blocks the User-Agent that tries to access a link. I’m afraid it is an issue from the Linkedin, is streets your site as fake User-Agent. Since the link is not broken, please feel free to set it as “Not Broken” link.



other


Apart from this, all 401, 403 (access restricted URLs) will fail. In that case, it is not really an error, but expected. For these cases, they could either be added as exclude link target entry, or we could make external link type errors configurable (e.g. have an exclude list for that as well, where you could exclude for example 401, 403, maybe also "too many redirects").


see also: https://notes.typo3.org/linkvalidator_problem_external_urls

Related:

sypets commented 3 years ago

Analysis of some URLs which are causing problems.

Currently brofix sends the following HTTP headers (see TSconfig):

User-Agent: configurable
Accept: */*
Accept-Language: *
Accept-Encoding: *

It looks like the Accept-Language / Accept-Encoding may be causing problems in some cases.

It is possible to simulate this with curl:

curl -IL -H "Accept-Language: *" -H "Accept-Encoding: *"


curl sends these headers (by default):

curl -ILv URL

HEAD /pages/de/news411455 HTTP/2 Host: idw-online.de user-agent: curl/7.68.0 accept: /

be sure to add the -L to follow redirects ....

sypets commented 2 years ago

Reason: Incomplete certificate chain

Example:

curl -I "https://www.ylook.de/search.php?&linklist_idx=11116563"
curl: (60) SSL certificate problem: unable to get local issuer certificate

Solutions

  1. implicit exclude (consider URLs as not broken for now) (this is not a good solution, as these servers often have other issues as well)
  2. Add intermediate certs on server.

This could be done with an extra tool but should not be implemented in brofix.

  1. extend client somehow (brofix:ExternalLinkType)
  2. extend guzzle somehow, see

That is a good start, but instead of extending the client, I suggest creating an event subscriber that can work for both synchronous and asynchronous requests.

Use custom CA bundle

Side note

The same error code (curl(60)) may also be the result for more severe TLS / certificate issues.

Same error code but different error message (in command line curl) !

certificate has expired self-signed certificate

  1. "Certificate name mismatch"). Unfortunately, it is not possible to determine which is the case, just from the error message.

Example:

curl -I "https://klimakongressoldenburg.de"
curl: (60) SSL certificate problem: self-signed certificate

Certificate has "Certificate Name mismatch, see Qualis SSL Labs

  1. Certificate expired, e.g.
curl -I https://openjournal.uni-oldenburg.de/
curl: (60) SSL certificate problem: certificate has expired

see resources:

guzzle

other

image

sypets commented 2 years ago

Reason: probably cloudflare DDoS protection I'm unter attack

<tr>
      <td align="center" valign="middle">
          <div class="cf-browser-verification cf-im-under-attack">
  <noscript>
    <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>