Closed JimKillock closed 5 years ago
@dantheta Is this something we can address ourselves or do they just need to whitelist us in their robots.txt
?
It could be either: if they say to us we can ignore their robots.txt we could perhaps make an exception in the software rules; otherwise they would whitelist us. It would be handy to know what the name of our user agent is.
We can override it in the database manually - we probably don't want to start writing exceptions into the robots.txt checker, since the system is supposed to follow the policy.
The URL record for web.archive.org in the system may revert to disallowed-by-robots-txt in the future, when the robots.txt is rechecked at a later date (unless the upstream robots.txt has been changed to allow us).
Our current probe useragent is:
OrgProbe/2.0.0 (+http://www.blocked.org.uk)
News article - https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html
based on this position blogpost by the Archive - https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
Seems like a precedent for asking them to make exception for public good project like Blocked.org.uk?
Why aren't we treating the successful retrieval of /robots.txt
as an "OK"?
Why aren't we treating the successful retrieval of
/robots.txt
as an "OK"?
The server providing the robots.txt file doesn't necessarily do its own comparison of the client's user-agent string and refuse to serve the resource; the robot is expected to download, parse, process and apply the contained policy to itself.
We're retrieving robots.txt, parsing the content and comparing it against our useragent. It contains:
User-agent: * Disallow: /
If robots.txt was missing, empty, or didn't list our user-agent, that would be an OK.
I think @gwire means that logically the site is accessible and unblocked if /robots.txt is found.
The answer I think is we have considered that it could be perceived as rude or a breach of policy to request /robots.txt across each network and on multiple regular occasions. Perhaps this is wrong though?
I mean, from the point of view of a successful test of if the site is blocked or not - surely a retrieval of /robots.txt
means we can say the site is available?
On 2019-02-18 13:00, gwire wrote:
I mean, from the point of view of a successful test of if the site is blocked or not - surely a retrieval of
/robots.txt
means we can say the site is available?
Sorry - I see what you mean.
Jim's comment is correct - we check robots.txt from the server just once, before sending the test URL to the probes.
The answer I think is we have considered that it could be perceived as rude or a breach of policy to request /robots.txt across each network and on multiple regular occasions. Perhaps this is wrong though?
I'd be surprised if they noticed it in their logs, given the sheer traffic they must deal with. Maybe write a short blogpost, make them aware of it, and then carry on? Ask for forgiveness if they raise it?
Archive.org have confirmed by email that they are ok with us ignoring their robots.txt - @dantheta up to you how you approach this at a technical level.
It's already doable on an individual site basis - the URL admin screen can be used to change the URL status from "disallowed-by-robots-txt" to "ok", which will allow checks to run.
web.archive.org is testable. It's very likely to stay that way for several months (we cache negative robots results for quite a qhile).
I've set up an extra tag on the URL to allow us to later write a check script for URLs which have that their robots-txt setting overridden, to make sure it still unset. "override-robots-txt"
Thanks @dantheta - I just gave that a go. however the page still won't test when i do a 'force check'.
Oh carp, it needs to be overridden on every single page URL in the system ... Thinking cap on!
web.archive.org is now allowed through the robots checker. I'll need to find and update the URLs that have already been added. This is a bit slow but will run overnight. In the meantime new URLs should be able to go through.
They're updated.
I've tried the force check url and requested a check via the front end but it is still claiming robots.txt is preventing us from making checks.
Which URL did you request? Only web.archive.org has been overridden. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
OK, I reset the status (again) and then force checked and then it has refreshed. However, there is a very old request for an unblock; I am not sure if this is now being sent as a result of the check or if it is several years old. I think the latter as I don't see auto-replies from supplieremails to a blocked.org.uk email.
Assuming that is the case we should probably allow a new request to be made (I think that issue has been raised elsewhere by @alexhaydock )
@JimKillock You are probably thinking of the issue I raised in #186
Old requests should probably be set to time out and then be available to be deleted or re-sent by admins.
The old unblock request was probably this one from November 2017: https://www.blocked.org.uk/control/ispreports/EE/http://archive.org
Yes that's right. Meantime I'd like to resent an unblock request via the system if we can.
Is there anything left here?
Is there anything left here?
Nope, I don't think so.
This site we apparently cannot test because of a robots.txt policy:
https://www.blocked.org.uk/site/http://web.archive.org
That's a problem as the site owners want to make a complaint via our site I believe. There is a campaign to get it unblocked.
I'm raising as a ticket although the solution might be to ask them to change robots.txt