gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.
MIT License
1.57k stars 196 forks source link

HTTP 302 (to the same URL?) reported as failures #803

Closed stevecheckoway closed 1 year ago

stevecheckoway commented 1 year ago

Given the following document linking to a recent CNN tweet shows the problem.

Here's the input file.

<!DOCTYPE html>
<a href='https://twitter.com/CNN/status/1688986037488398337'>X</a>

Here's the output.

$ htmlproofer /tmp/a.html
Running 3 checks (Images, Links, Scripts) in /tmp/a.html on *.html files ...

Checking 1 external link
Checking 0 internal links
Checking internal link hashes in 0 files
Ran on 1 file!

For the Links > External check, the following failures were found:

* At /tmp/a.html:2:

  External link https://twitter.com/CNN/status/1688986037488398337 failed (status code 302)

HTML-Proofer found 1 failure!

Here's the curl output.

$  curl -i https://twitter.com/CNN/status/1688986037488398337
HTTP/2 302
date: Tue, 08 Aug 2023 19:01:40 GMT
perf: 7626143928
vary: Accept
server: tsa_p
location: /CNN/status/1688986037488398337
set-cookie: guest_id_marketing=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id_ads=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: personalization_id="v1_EkXMRmMQFQuZSli6TwF04A=="; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
content-type: text/plain; charset=utf-8
x-powered-by: Express
cache-control: no-cache, no-store, max-age=0
content-length: 53
x-transaction-id: b969b434337adbe8
strict-transport-security: max-age=631138519
x-response-time: 14
x-connection-hash: 975bcbcbef786c98f324095488fd265dc5d224dec81a059544a8563b0b9c334f

Found. Redirecting to /CNN/status/1688986037488398337

If I had to guess, I'd say it's redirecting to the same location but setting cookies and the website is probably checking if the cookies are set. Indeed, with a little testing, this seems to be exactly what's happening.

If I configure curl to follow redirects (via -L), I get an infinite loop. If I tell curl to use a cookie jar and follow the redirects, it succeeds.

$ curl -iL -b cookiejar -c cookiejar https://twitter.com/CNN/status/1688986037488398337

It seems like two approaches to dealing with this:

  1. If you get a 302 with the same Location header, treat the page as existing (although that won't work with hashes),
  2. Configure the HTTP client to use a cookie jar. Since it's likely to have multiple links to the same pages, it seems reasonable to use the same cookie jar for all requests.
stevecheckoway commented 1 year ago

I'm going to close this issue because I figured out that I, as the user, can configure htmlproofer using a cookie jar or not.

From the command line, it is

$ htmlproofer --typhoeus '{ "followlocation": true, "cookiefile": "cookiejar.txt", "cookiejar": "cookiejar.txt" }' /tmp/a.html

From Ruby, the configuration is something like

{
    typhoeus: {
      followlocation: true,
      cookiefile: 'cookiejar.txt',
      cookiejar: 'cookiejar.txt'
  }
}

It may be worth adding this information to the configuration section of the README.

gjtorikian commented 1 year ago

It may be worth adding this information to the configuration section of the README.

PRs accepted. 😀