Closed bobemoe closed 4 years ago
What's the verdict on this?
The test in #66 against the localhost test server shows the issue:
./http-status-check scan http://localhost:8080/redirectToNotFound
Start scanning http://localhost:8080/redirectToNotFound
[2020-02-18 21:52:28] 302 Found - http://localhost:8080/redirectToNotFound
Crawling summary
----------------
Crawled 1 url(s) with statuscode 302
That should 404 in the end.
So assuming thats expected behaviour, then why isn't it happening against my jhodges.co.uk test above. That's what I'd really expect to see, a 302 and a 404. Not just a 302.
Apache instead of nginx/node.express?
Before I found http-status-check I was just using spatie/crawler and coding my own CrawlObserver, very similar to this, and to get round the issue I added RequestOptions::ALLOW_REDIRECTS => track_redirects => true
to the constructor of the crawler.
This now showed the link as a 404 but still had the old URL and not the final redirect address. Also 302's just showed as 200's. This was somewhat useful but a bit weird.
Inside the observer I updated the URL from what I wanted from:
// Retrieve both Redirect History headers
$redirectUriHistory = $response->getHeader('X-Guzzle-Redirect-History'); // retrieve Redirect URI history
$redirectCodeHistory = $response->getHeader('X-Guzzle-Redirect-Status-History'); // retrieve Redirect HTTP Status history
$fullRedirectReport=[$redirectUriHistory,$redirectCodeHistory];
I managed to get the data I wanted but it all seemed a bit unnecessary weird :/
Any thoughts?
Getting closer. Its because there is actually a link on the 301/302 body that the crawler is following.
$ curl https://jhodges.co.uk/test301-to-404.html
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://jhodges.co.uk/test404.html">here</a>.</p>
</body></html>
vs
$ curl https://hudevad.com/hudevad-1890
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>
vs
$ curl http://localhost:8080/redirectToNotFound
Found. Redirecting to /notExists2
So its not just 404's that the issue is with, entire parts of the site may not be crawled if its only linked to via something redirecting... test on /link4 which should redirect to /link1 and expose example.com
$ ./http-status-check scan http://localhost:8080/link4
Start scanning http://localhost:8080/link4
[2020-02-19 10:27:19] 302 Found - http://localhost:8080/link4
Crawling summary
----------------
Crawled 1 url(s) with statuscode 302
It wasn't followed into the rest of the site.
If I enable RequestOptions::ALLOW_REDIRECTS
I get:
$ ./http-status-check scan http://localhost:8080/link4
Start scanning http://localhost:8080/link4
[2020-02-19 10:30:26] 200 OK - http://localhost:8080/link4
[2020-02-19 10:30:26] 200 OK - http://example.com/
Crawling summary
----------------
Crawled 2 url(s) with statuscode 200
All pages are found now, but we loose the info about there being redirects, and /link1 isn't shown as a valid page.
The output I would expect to see is:
302 Found - http://localhost:8080/link4
200 OK - http://localhost:8080/link1
200 OK - http://example.com/
Those PRs you created are marked as WIP
remove that prefix and let me know, when those are ready and I'll review them.
Yeah, I'm having problems haha... to achieve the results I am expecting I'm having to quite heavily recode some of the core logic. I'm following this https://github.com/guzzle/guzzle/blob/master/docs/faq.rst#how-can-i-track-redirected-requests and adding the history to the crawl log, but that has potential to add duplicates unless I pass them through the queue.
I'm not sure where it'll go yet, so I'm working on my own link report generator, and once that's working I will look at porting it here, some of it may be beyond the scope of this project and other may belong in spatie/crawler itself.
@freekmurze right, I've managed to get it working as per my requirements in my own project https://github.com/bobemoe/sitemap I've documented the changes there and made lots of tests. Maybe you'd like to take a look and let me know how much of the functionality you'd like me to port here.
@freekmurze the PR to fix this is ready. Please review :)
We'll continue the conversation in the review.
I cant figure this one out! I noticed the results were missing some known 404's so crawled one directly. I thought it was because its behind a 301 it is not picked up:
However when I tried to recreate this on another domain it seems to work as expected:
I checked the server is returning all headers correctly, curl follows it just fine.