textlint-rule / textlint-rule-no-dead-link

textlint rule to check if all links are alive.
30 stars 11 forks source link

Error `ECONNRESET` for validating some of the external links #111

Closed amimas closed 5 years ago

amimas commented 5 years ago

I have the following snippet in my markdown file:

Navigate to [MySQL distribution](https://dev.mysql.com/downloads/mysql/) to install MySQL `5.7`.

This rule keep failing with following error message:

  60:34  error  https://dev.mysql.com/downloads/mysql/ is dead. (request to https://dev.mysql.com/downloads/mysql/ failed, reason: connect ECONNRESET 137.254.60.11:443)  no-dead-link

The link, https://dev.mysql.com/downloads/mysql/, is valid but not sure why the rule is unable to verify this. When I try curl -v https://dev.mysql.com/downloads/mysql/ from my terminal, I get the following headers + the site's html responses:

*   Trying 137.254.60.11...
* TCP_NODELAY set
* Connected to dev.mysql.com (137.254.60.11) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=California; L=Redwood City; O=Oracle Corporation; OU=Production Engineering and Operation; CN=www.mysql.com
*  start date: Jan 25 00:00:00 2019 GMT
*  expire date: Mar 25 12:00:00 2020 GMT
*  subjectAltName: host "dev.mysql.com" matched cert's "dev.mysql.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
> GET /downloads/mysql/ HTTP/1.1
> Host: dev.mysql.com
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 21 Jun 2019 14:17:44 GMT
< Server: Apache
< X-Frame-Options: SAMEORIGIN
< Strict-Transport-Security: max-age=15768000
< Set-Cookie: MySQL_S=4rb6luvlhcf1lteuqv1es0l54c9n0c9n; path=/; domain=mysql.com; HttpOnly
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate
< Pragma: no-cache
< Cache-Control: no-cache, private
< Vary: Accept-Encoding
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
< 
<

I am getting 200 OK response from curl and obviously I can access the same link from my browser but I can't actually ping that domain. For example, ping dev.mysql.com comes back with 0 success. I think the server is configured to not respond to pings because I tried this external ping utility site, https://www.ipaddressguide.com/ping, and that's also coming back with 100% failures.

Not sure if this is a bug within the linter or if there's room for improvements. Right now my only option is to add that domain to the ignore option of this rule.

amimas commented 5 years ago

Running into similar ECONRESET error for some other external link too.

34:28   error  https://jcp.org/en/jsr/detail?id=330 is dead. (request to https://jcp.org/en/jsr/detail?id=330 failed, reason: read ECONNRESET)                      no-dead-link
  83:198  error  https://www.osgi.org/developer/specifications/ is dead. (request to https://www.osgi.org/developer/specifications/ failed, reason: read ECONNRESET)  no-dead-link
  26:42  error  https://jax-rs-spec.java.net/ is dead. (request to https://jax-rs-spec.java.net/ failed, reason: read ECONNRESET)  no-dead-link

Sometimes the above errors only appear in my local machine and not in the CI pipeline. Sometimes they appear in both.

amimas commented 5 years ago

I just realized the same link gets reported as invalid with different reason. For example:

  34:28  error  https://jcp.org/en/jsr/detail?id=330 is dead. (request to https://jcp.org/en/jsr/detail?id=330 failed, reason: connect ECONNRESET 137.254.60.38:443)  no-dead-link

This shows it's failing with reason: connect ECONNRESET and the exact same link mentioned in previous comment had reason: read ECONNRESET

amimas commented 5 years ago

I'm also getting ECONREFUSED error from time to time:

   7:22   error  http://tools.ietf.org/html/rfc6749 is dead. (request to http://tools.ietf.org/html/rfc6749 failed, reason: connect ECONNREFUSED 64.170.98.42:80)                  no-dead-link
  16:140  error  http://tools.ietf.org/html/rfc6749#page-10 is dead. (request to http://tools.ietf.org/html/rfc6749#page-10 failed, reason: connect ECONNREFUSED 64.170.98.42:80)  no-dead-link
azu commented 5 years ago

Thanks for report.

curl -H 'User-Agent:' -H 'Accept:' -H 'Host:' -v https://dev.mysql.com/downloads/mysql/
*   Trying 137.254.60.11...
* TCP_NODELAY set
* Connected to dev.mysql.com (137.254.60.11) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=California; L=Redwood City; O=Oracle Corporation; OU=Production Engineering and Operation; CN=www.mysql.com
*  start date: Jan 25 00:00:00 2019 GMT
*  expire date: Mar 25 12:00:00 2020 GMT
*  subjectAltName: host "dev.mysql.com" matched cert's "dev.mysql.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
> GET /downloads/mysql/ HTTP/1.1
>
< HTTP/1.1 400 Bad Request
< Date: Sun, 23 Jun 2019 03:45:28 GMT
< Server: Apache
< X-Frame-Options: SAMEORIGIN
< Content-Length: 226
< Connection: close
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
</body></html>
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, Client hello (1):

Probably, mysql.com refuse a request without UserAgent.

We should add default user-agent and accept header?

https://github.com/textlint-rule/textlint-rule-no-dead-link/blob/3f9508c2de463c99ac257b9e68b7d20e75e3d8dc/src/no-dead-link.js#L66-L76

azu commented 5 years ago

I've added default User-Agent and Accept header by deault https://github.com/textlint-rule/textlint-rule-no-dead-link/pull/116 But, test is passed without UserAgent and Accept Header. Maybe, there are another reason for this issue.

azu commented 5 years ago

Release https://github.com/textlint-rule/textlint-rule-no-dead-link/releases/tag/4.4.2

@amimas Can you try it again?

amimas commented 5 years ago

Thanks @azu. I will try it out. You're right that there could be other reasons as well. I have been getting invalid link report from some of the links inconsistently.

amimas commented 5 years ago

I just got the chance to try it out. It's probably working for the ECONNRESET related errors. The following error disappeared after I updated to 4.4.2

  83:198  error  https://www.osgi.org/developer/specifications/ is dead. (request to https://www.osgi.org/developer/specifications/ failed, reason: socket hang up)  no-dead-link

Update: The above error was probably a one time issue. I didn't get that error before. But, so far I haven't seen the ECONRESET error yet. Will continue to run the tests.

But, I'm still getting ECONREFUSED error from these two links:

   7:22   error  https://tools.ietf.org/html/rfc6749 is dead. (request to https://tools.ietf.org/html/rfc6749 failed, reason: connect ECONNREFUSED 64.170.98.42:443)                  no-dead-link
  16:140  error  https://tools.ietf.org/html/rfc6749#page-10 is dead. (request to https://tools.ietf.org/html/rfc6749#page-10 failed, reason: connect ECONNREFUSED 64.170.98.42:443)  no-dead-link
azu commented 5 years ago

I figure out that ieft.org require Host: header.

curl -I -H 'User-Agent: a' -H 'Host:' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749

> HEAD /html/rfc6749 HTTP/1.1
> User-Agent: a
>
< HTTP/1.1 400 Bad Request
HTTP/1.1 400 Bad Request

Same host

curl -I -H 'User-Agent: a' -H 'Host: tools.ietf.org' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749

> HEAD /html/rfc6749 HTTP/1.1
> Host: tools.ietf.org
> User-Agent: a
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
amimas commented 5 years ago

That's a good finding @azu

I did some further tests from your findings and as expected if I set the User-Agent to be what is set by Chrome browser, I get 200 OK response from this host. For example:

$ curl -I -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749

*   Trying 64.170.98.42...
* TCP_NODELAY set
* Connected to tools.ietf.org (64.170.98.42) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: OU=Domain Control Validated; CN=*.tools.ietf.org
*  start date: Oct  1 17:24:13 2018 GMT
*  expire date: Nov 30 23:34:19 2019 GMT
*  subjectAltName: host "tools.ietf.org" matched cert's "tools.ietf.org"
*  issuer: C=US; ST=Arizona; L=Scottsdale; O=Starfield Technologies, Inc.; OU=http://certs.starfieldtech.com/repository/; CN=Starfield Secure Certificate Authority - G2
*  SSL certificate verify ok.
> HEAD /html/rfc6749 HTTP/1.1
> Host: tools.ietf.org
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
...

These also worked:

curl -I -H 'User-Agent: Chrome' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749
curl -I -H 'User-Agent: Firefox' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749

I think we need the User-Agent value to be configurable option while setting a more sensible default.

Right now the User-Agent value is set to this rule after the latest update in 4.4.2 but it seems some web servers don't like it.

    headers: {
      'User-Agent': 'textlint-rule-no-dead-link/1.0',
      'Accept': '*/*'
    },

I am trying to decide if we need a Global configuration of User-Agent that applies to all URLs or if we need per domain/url specific configuration of User-Agent.

What's your thought?

azu commented 5 years ago
curl -I -H 'Host:' -H 'User-Agent: Chrome' -H 'Accept:' -v https://tools.ietf.org/html/rfc6749

tools.ietf.org return 400 If Host is null. UA is not related with that 400.

amimas commented 5 years ago

@azu - Is it possible to release the latest fix? I can try it out and see if it fixes all of those scenarios discussed above.

azu commented 5 years ago

It could be fixed. Please tell me if you found broken case.

Thanks

2019年7月9日(火) 0:47 amimas notifications@github.com:

@azu https://github.com/azu - Is it possible to release the latest fix? I can try it out and see if it fixes all of those scenarios discussed above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/textlint-rule/textlint-rule-no-dead-link/issues/111?email_source=notifications&email_token=AAAE2AWP3AZLMN3TEY5KRADP6NOSZA5CNFSM4H2SNT5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZNQJNI#issuecomment-509281461, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAE2AUSNVM74MAQUWARTALP6NOSZANCNFSM4H2SNT5A .

--

Name : azu Mail : azuciao@gmail.com

amimas commented 5 years ago

@azu - You haven't released your latest PR #117 yet. If you can release that as version 4.4.3, I can test it out. Or please let me know if there's another way I can test it before you release it.

azu commented 5 years ago

Oh, sorry. https://github.com/textlint-rule/textlint-rule-no-dead-link/releases/tag/4.4.3

2019年7月9日(火) 23:40 amimas notifications@github.com:

@azu https://github.com/azu - You haven't released your latest PR #117 https://github.com/textlint-rule/textlint-rule-no-dead-link/pull/117 yet. If you can release that as version 4.4.3, I can test it out. Or please let me know if there's another way I can test it before you release it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/textlint-rule/textlint-rule-no-dead-link/issues/111?email_source=notifications&email_token=AAAE2AROEGM4BI4XKLUKZBDP6SPPFA5CNFSM4H2SNT5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZQPKVA#issuecomment-509670740, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAE2ARMKKQT2CFP47M6E7DP6SPPFANCNFSM4H2SNT5A .

--

Name : azu Mail : azuciao@gmail.com

amimas commented 5 years ago

@azu - Thanks for releasing that but unfortunately it didn't seem to help. I'm still getting ECONNREFUSED or ECONNRESET errors from the same links as before:

  61:34  error  https://dev.mysql.com/downloads/mysql/ is dead. (request to https://dev.mysql.com/downloads/mysql/ failed, reason: read ECONNRESET)  no-dead-link
   7:22   error  https://tools.ietf.org/html/rfc6749 is dead. (request to https://tools.ietf.org/html/rfc6749 failed, reason: connect ECONNREFUSED 64.170.98.42:443)                  no-dead-link
  16:140  error  https://tools.ietf.org/html/rfc6749#page-10 is dead. (request to https://tools.ietf.org/html/rfc6749#page-10 failed, reason: connect ECONNREFUSED 64.170.98.42:443)  no-dead-link

On top of that the latest change in the Host value is causing a lot of other external site's validation to fail because of this Hostname/IP does not match certificate's altnames error. Below is couple of examples:

  53:545  error    http://bugs.sun.com/view_bug.do?bug_id=6570259 is dead. (request to https://bugs.java.com/view_bug.do?bug_id=6570259 failed, reason: Hostname/IP does not match certificate's altnames: Host: bugs.sun.com. is not in the cert's altnames: DNS:bugs.java.com)  no-dead-link
  26:34   error    http://wiki.osgi.org/wiki/Blueprint is dead. (request to https://www.osgi.org/community/wiki/wiki/Blueprint failed, reason: Hostname/IP does not match certificate's altnames: Host: wiki.osgi.org. is not in the cert's altnames: DNS:*.wpengine.com, DNS:wpengine.com)  no-dead-link

In addition, I'm getting a lot of errors being reported due to maximum redirect reached error. Here're some examples:

    7:51  error    http://quartz-scheduler.org/ is dead. (maximum redirect reached at: http://www.quartz-scheduler.org/)  no-dead-link
  128:76   error    http://static.springsource.org/spring/docs/2.0.x/api/org/springframework/scheduling/concurrent/ThreadPoolTaskExecutor.html is dead. (maximum redirect reached at: http://docs.spring.io/spring/docs/2.0.x/api/org/springframework/scheduling/concurrent/ThreadPoolTaskExecutor.html)  no-dead-link

The latest change in Host is definitely not fixing the original issue and causing new issue with certificate validation. Not sure why yet why the maximum redirect issues being reported now. I suggest you revert the changes in the last two releases, as 4.4.1 is still more stable release.

In the meantime, I think we need to continue to investigate this issue. Please re-open this ticket.

amimas commented 5 years ago

I have been looking into this in more details. I think the latest release (4.4.3) is trying to do the right thing and the Hostname/IP does not match certificate's altnames error is valid. This is due to really old links that should be replaced with appropriate valid links.

Unfortunately I can't yet see following errors are appearing even though the ignoreRedirect option is set to true in the rule's configuration:

  503:78   error    http://java.sun.com/javaee/5/docs/api/index.html?javax/persistence/EntityManager.html is dead. (301 Moved Permanently)  no-dead-link
   30:40   error    http://static.springsource.org/spring/docs/3.0.0.RC3/reference/html/ch05s07.html is dead. (maximum redirect reached at: http://docs.spring.io/spring/docs/3.0.0.RC3/reference/html/ch05s07.html)  no-dead-link
   36:273  error    http://www.eaipatterns.com/StoreInLibrary.html is dead. (302 Found)  no-dead-link

All of these links are automatically being redirected to a newer link but the ignoreRedirect seems to not work in these cases.

I will open a separate issue regarding ECONNREFUESED error

azu commented 5 years ago

ignoreRedirect**s** ?

amimas commented 5 years ago

You're right, the configuration option is plural (ignoreRedirects). Just double checked my .textlintrc.json and verified that I am using the correct option name. And still those redirect related errors are appearing.