spatie / http-status-check

CLI tool to crawl a website and check HTTP status codes
https://freek.dev/308-building-a-crawler-in-php
MIT License
594 stars 89 forks source link

Not picking up some 404's #62

Closed bobemoe closed 4 years ago

bobemoe commented 4 years ago

I cant figure this one out! I noticed the results were missing some known 404's so crawled one directly. I thought it was because its behind a 301 it is not picked up:

$ ~/vendor/bin/http-status-check scan https://hudevad.com/hudevad-1890

Start scanning https://hudevad.com/hudevad-1890

[2020-02-17 11:59:55] 301 Moved Permanently - https://hudevad.com/hudevad-1890

Crawling summary
----------------
Crawled 1 url(s) with statuscode 301

However when I tried to recreate this on another domain it seems to work as expected:

$ ~/vendor/bin/http-status-check scan https://jhodges.co.uk/test301-to-404.html

Start scanning https://jhodges.co.uk/test301-to-404.html

[2020-02-17 11:59:44] 301 Moved Permanently - https://jhodges.co.uk/test301-to-404.html
[2020-02-17 11:59:44] 404: Not Found - https://jhodges.co.uk/test404.html (found on https://jhodges.co.uk/test301-to-404.html)

Crawling summary
----------------
Crawled 1 url(s) with statuscode 301
Crawled 1 url(s) with statuscode 404

I checked the server is returning all headers correctly, curl follows it just fine.

bobemoe commented 4 years ago

What's the verdict on this?

The test in #66 against the localhost test server shows the issue:

./http-status-check scan http://localhost:8080/redirectToNotFound

Start scanning http://localhost:8080/redirectToNotFound

[2020-02-18 21:52:28] 302 Found - http://localhost:8080/redirectToNotFound

Crawling summary
----------------
Crawled 1 url(s) with statuscode 302

That should 404 in the end.

So assuming thats expected behaviour, then why isn't it happening against my jhodges.co.uk test above. That's what I'd really expect to see, a 302 and a 404. Not just a 302.

Apache instead of nginx/node.express?

Before I found http-status-check I was just using spatie/crawler and coding my own CrawlObserver, very similar to this, and to get round the issue I added RequestOptions::ALLOW_REDIRECTS => track_redirects => true to the constructor of the crawler.

This now showed the link as a 404 but still had the old URL and not the final redirect address. Also 302's just showed as 200's. This was somewhat useful but a bit weird.

Inside the observer I updated the URL from what I wanted from:

        // Retrieve both Redirect History headers
            $redirectUriHistory = $response->getHeader('X-Guzzle-Redirect-History'); // retrieve Redirect URI history
            $redirectCodeHistory = $response->getHeader('X-Guzzle-Redirect-Status-History'); // retrieve Redirect HTTP Status history
            $fullRedirectReport=[$redirectUriHistory,$redirectCodeHistory];

I managed to get the data I wanted but it all seemed a bit unnecessary weird :/

Any thoughts?

bobemoe commented 4 years ago

Getting closer. Its because there is actually a link on the 301/302 body that the crawler is following.

$ curl https://jhodges.co.uk/test301-to-404.html
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://jhodges.co.uk/test404.html">here</a>.</p>
</body></html>

vs

$ curl https://hudevad.com/hudevad-1890
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>

vs

$ curl http://localhost:8080/redirectToNotFound
Found. Redirecting to /notExists2
bobemoe commented 4 years ago

So its not just 404's that the issue is with, entire parts of the site may not be crawled if its only linked to via something redirecting... test on /link4 which should redirect to /link1 and expose example.com

$ ./http-status-check scan http://localhost:8080/link4

Start scanning http://localhost:8080/link4

[2020-02-19 10:27:19] 302 Found - http://localhost:8080/link4

Crawling summary
----------------
Crawled 1 url(s) with statuscode 302

It wasn't followed into the rest of the site.

If I enable RequestOptions::ALLOW_REDIRECTS I get:

$ ./http-status-check scan http://localhost:8080/link4

Start scanning http://localhost:8080/link4

[2020-02-19 10:30:26] 200 OK - http://localhost:8080/link4
[2020-02-19 10:30:26] 200 OK - http://example.com/

Crawling summary
----------------
Crawled 2 url(s) with statuscode 200

All pages are found now, but we loose the info about there being redirects, and /link1 isn't shown as a valid page.

The output I would expect to see is:

302 Found - http://localhost:8080/link4
200 OK - http://localhost:8080/link1
200 OK - http://example.com/
freekmurze commented 4 years ago

Those PRs you created are marked as WIP remove that prefix and let me know, when those are ready and I'll review them.

bobemoe commented 4 years ago

Yeah, I'm having problems haha... to achieve the results I am expecting I'm having to quite heavily recode some of the core logic. I'm following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests and adding the history to the crawl log, but that has potential to add duplicates unless I pass them through the queue.

I'm not sure where it'll go yet, so I'm working on my own link report generator, and once that's working I will look at porting it here, some of it may be beyond the scope of this project and other may belong in spatie/crawler itself.

bobemoe commented 4 years ago

@freekmurze right, I've managed to get it working as per my requirements in my own project https://github.com/bobemoe/sitemap I've documented the changes there and made lots of tests. Maybe you'd like to take a look and let me know how much of the functionality you'd like me to port here.

bobemoe commented 4 years ago

@freekmurze the PR to fix this is ready. Please review :)

freekmurze commented 4 years ago

We'll continue the conversation in the review.