Web archive - Githubissues

JimKillock commented 5 years ago

This site we apparently cannot test because of a robots.txt policy:

https://www.blocked.org.uk/site/http://web.archive.org

That's a problem as the site owners want to make a complaint via our site I believe. There is a campaign to get it unblocked.

I'm raising as a ticket although the solution might be to ask them to change robots.txt

edjw commented 5 years ago

@dantheta Is this something we can address ourselves or do they just need to whitelist us in their robots.txt?

JimKillock commented 5 years ago

It could be either: if they say to us we can ignore their robots.txt we could perhaps make an exception in the software rules; otherwise they would whitelist us. It would be handy to know what the name of our user agent is.

dantheta commented 5 years ago

We can override it in the database manually - we probably don't want to start writing exceptions into the robots.txt checker, since the system is supposed to follow the policy.

The URL record for web.archive.org in the system may revert to disallowed-by-robots-txt in the future, when the robots.txt is rechecked at a later date (unless the upstream robots.txt has been changed to allow us).

Our current probe useragent is:

OrgProbe/2.0.0 (+http://www.blocked.org.uk)

ei8fdb commented 5 years ago

News article - https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html

based on this position blogpost by the Archive - https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

Seems like a precedent for asking them to make exception for public good project like Blocked.org.uk?

gwire commented 5 years ago

Why aren't we treating the successful retrieval of /robots.txt as an "OK"?

dantheta commented 5 years ago

Why aren't we treating the successful retrieval of /robots.txt as an "OK"?

The server providing the robots.txt file doesn't necessarily do its own comparison of the client's user-agent string and refuse to serve the resource; the robot is expected to download, parse, process and apply the contained policy to itself.

We're retrieving robots.txt, parsing the content and comparing it against our useragent. It contains:

User-agent: * Disallow: /

If robots.txt was missing, empty, or didn't list our user-agent, that would be an OK.

JimKillock commented 5 years ago

I think @gwire means that logically the site is accessible and unblocked if /robots.txt is found.

The answer I think is we have considered that it could be perceived as rude or a breach of policy to request /robots.txt across each network and on multiple regular occasions. Perhaps this is wrong though?

gwire commented 5 years ago

I mean, from the point of view of a successful test of if the site is blocked or not - surely a retrieval of /robots.txt means we can say the site is available?

dantheta commented 5 years ago

On 2019-02-18 13:00, gwire wrote:

I mean, from the point of view of a successful test of if the site is blocked or not - surely a retrieval of /robots.txt means we can say the site is available?

Sorry - I see what you mean.

Jim's comment is correct - we check robots.txt from the server just once, before sending the test URL to the probes.

ei8fdb commented 5 years ago

The answer I think is we have considered that it could be perceived as rude or a breach of policy to request /robots.txt across each network and on multiple regular occasions. Perhaps this is wrong though?

I'd be surprised if they noticed it in their logs, given the sheer traffic they must deal with. Maybe write a short blogpost, make them aware of it, and then carry on? Ask for forgiveness if they raise it?

JimKillock commented 5 years ago

Archive.org have confirmed by email that they are ok with us ignoring their robots.txt - @dantheta up to you how you approach this at a technical level.

dantheta commented 5 years ago

It's already doable on an individual site basis - the URL admin screen can be used to change the URL status from "disallowed-by-robots-txt" to "ok", which will allow checks to run.

dantheta commented 5 years ago

web.archive.org is testable. It's very likely to stay that way for several months (we cache negative robots results for quite a qhile).

I've set up an extra tag on the URL to allow us to later write a check script for URLs which have that their robots-txt setting overridden, to make sure it still unset. "override-robots-txt"

JimKillock commented 5 years ago

Thanks @dantheta - I just gave that a go. however the page still won't test when i do a 'force check'.

dantheta commented 5 years ago

Oh carp, it needs to be overridden on every single page URL in the system ... Thinking cap on!

dantheta commented 5 years ago

web.archive.org is now allowed through the robots checker. I'll need to find and update the URLs that have already been added. This is a bit slow but will run overnight. In the meantime new URLs should be able to go through.

dantheta commented 5 years ago

They're updated.

JimKillock commented 5 years ago

I've tried the force check url and requested a check via the front end but it is still claiming robots.txt is preventing us from making checks.

dantheta commented 5 years ago

Which URL did you request? Only web.archive.org has been overridden. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

JimKillock commented 5 years ago

OK, I reset the status (again) and then force checked and then it has refreshed. However, there is a very old request for an unblock; I am not sure if this is now being sent as a result of the check or if it is several years old. I think the latter as I don't see auto-replies from supplieremails to a blocked.org.uk email.

Assuming that is the case we should probably allow a new request to be made (I think that issue has been raised elsewhere by @alexhaydock )

alexhaydock commented 5 years ago

@JimKillock You are probably thinking of the issue I raised in #186

Old requests should probably be set to time out and then be available to be deleted or re-sent by admins.

The old unblock request was probably this one from November 2017: https://www.blocked.org.uk/control/ispreports/EE/http://archive.org

JimKillock commented 5 years ago

Yes that's right. Meantime I'd like to resent an unblock request via the system if we can.

edjw commented 5 years ago

Is there anything left here?

dantheta commented 5 years ago

Is there anything left here?

Nope, I don't think so.

openrightsgroup / cmp-issues

Web archive #194