httprb / http

HTTP (The Gem! a.k.a. http.rb) - a fast Ruby HTTP client with a chainable API, streaming support, and timeouts
MIT License
3.01k stars 321 forks source link

403 and 404 responses for valid URLs #750

Closed MothOnMars closed 1 year ago

MothOnMars commented 1 year ago

Steps:

Expected Results

Actual Results

Notes Detailed logs:

> RUBY_VERSION
=> "3.0.6"
> logger = Logger.new(STDOUT)
> http = HTTP.use(logging: {logger: logger})
> http.get('https://www.mhpcc.hpc.mil/').status
I, [2023-05-17T13:35:32.067557 #21189]  INFO -- : > GET https://www.mhpcc.hpc.mil/
D, [2023-05-17T13:35:32.067629 #21189] DEBUG -- : Connection: close
Host: www.mhpcc.hpc.mil
User-Agent: http.rb/5.1.1

I, [2023-05-17T13:35:35.722418 #21189]  INFO -- : < 404 Not Found
D, [2023-05-17T13:35:35.722726 #21189] DEBUG -- : Date: Wed, 17 May 2023 13:35:35 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: session=expiry=1684331135660082;Max-Age=600;path=/private;httponly;secure;;HttpOnly;secure
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' puka.mhpcc.hpc.mil; object-src 'self'
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 438
Content-Type: text/html; charset=UTF-8
Set-Cookie: httponly=expiry=1684331135659793;Max-Age=600;secure;HttpOnly;secure
Set-Cookie: httponly=expiry=1684331135659793;Max-Age=600;secure
Set-Cookie: session=expiry=1684331135660082;Max-Age=600;path=/private;httponly;secure;
Connection: close

<html>

<head>
<title>MHPCC: 404 Error, File Not Found</title>
</head>

<body>

<h1>Sorry, but the page you requested is not located on our server.</h1>

<p>Perhaps you can navigate to the desired content from our <a
href="/">homepage</a>.</p>

<p>If you feel this is an error, please let us know so that we may fix the broken link.
Send your comments to the Webmaster <a href="/comments.php?address=web">here</a>.</p>

</body>
</html>
=> 404
> http.get('https://www.mhpcc.hpc.mil/hardware/index.html').status
I, [2023-05-17T13:43:35.358813 #21189]  INFO -- : > GET https://www.mhpcc.hpc.mil/hardware/index.html
D, [2023-05-17T13:43:35.358880 #21189] DEBUG -- : Connection: close
Host: www.mhpcc.hpc.mil
User-Agent: http.rb/5.1.1

I, [2023-05-17T13:43:36.191372 #21189]  INFO -- : < 403 Forbidden
D, [2023-05-17T13:43:36.191704 #21189] DEBUG -- : Date: Wed, 17 May 2023 13:43:36 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: session=expiry=1684331616129150;Max-Age=600;path=/private;httponly;secure;;HttpOnly;secure
Cache-Control: no-cache, private
Set-Cookie: session=expiry=1684331616129150;Max-Age=600;path=/private;httponly;secure;
Content-Length: 221
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /hardware/index.html
on this server.</p>
</body></html>

=> 403

Possibly related to https://github.com/httprb/http/issues/612.

tarcieri commented 1 year ago

FWIW I get an SSL verification error from http.rb. It loads in Chrome, though.

403 in particular is pretty strange, since that's a server-side access control error. Is it possible the server is introspecting the request headers?

MothOnMars commented 1 year ago

Anything is possible, as that is a US military domain.

I also get the SSL error on a different machine for both httprb and URI, even after updating my certs:

> HTTP.get('https://www.mhpcc.hpc.mil/').status
OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed (unable to get local issuer certificate)
from /Users/marthacthompson/.rvm/gems/ruby-3.0.6@searchgov-rails42/gems/http-5.1.1/lib/http/timeout/null.rb:27:in `connect'

Could a cert issue on my original test machine result in 404/403 responses? I can't figure out why httprb and URI would get different statuses on the same machine. FWIW, curling from that machine also succeeds:

$ curl -I https://www.mhpcc.hpc.mil/
HTTP/1.1 200 OK
Date: Wed, 17 May 2023 14:13:15 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: session=expiry=1684333395524847;Max-Age=600;path=/private;httponly;secure;;HttpOnly;secure
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Cache-Control: no-cache, private
Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' puka.mhpcc.hpc.mil; object-src 'self'
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Type: text/html; charset=UTF-8
Set-Cookie: httponly=expiry=1684333395524190;Max-Age=600;secure;HttpOnly;secure
Set-Cookie: httponly=expiry=1684333395524190;Max-Age=600;secure
Set-Cookie: session=expiry=1684333395524847;Max-Age=600;path=/private;httponly;secure;

$ curl -I https://www.mhpcc.hpc.mil/hardware/index.html
HTTP/1.1 200 OK
Date: Wed, 17 May 2023 14:13:40 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: session=expiry=1684333420678228;Max-Age=600;path=/private;httponly;secure;;HttpOnly;secure
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Cache-Control: no-cache, private
Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' puka.mhpcc.hpc.mil; object-src 'self'
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Type: text/html; charset=UTF-8
Set-Cookie: httponly=expiry=1684333420677668;Max-Age=600;secure;HttpOnly;secure
Set-Cookie: httponly=expiry=1684333420677668;Max-Age=600;secure
Set-Cookie: session=expiry=1684333420678228;Max-Age=600;path=/private;httponly;secure;
tarcieri commented 1 year ago

If it were just the SSL error you wouldn't get any status code at all. Is your other machine a Mac by any chance? That's what I was testing on.

MothOnMars commented 1 year ago

Thanks, that's what I figured.

The other machine that is returning the 4xx responses is Ubuntu Linux.

ixti commented 1 year ago

If URI.open works and http.get does not, it most likely server reacts on some request headers. Some that come in mind:

Try using HTTP.use(:auto_inflate).get(...)

MothOnMars commented 1 year ago

Thanks, but the result is the same:

> HTTP.use(:auto_inflate).get('https://www.mhpcc.hpc.mil/').status
=> 404
> HTTP.use(:auto_inflate).get('https://www.mhpcc.hpc.mil/hardware/index.html').status
=> 403
ixti commented 1 year ago

As I said earlier, it reacts on some headers. From quick poking in firefox, I was able to make it fail with 404 by removing Accept header. So, I would assume adding that header should help:

HTTP
  .headers(accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8")
  .user(:auto_infalte)
  .get('https://www.mhpcc.hpc.mil/')
  .status