Closed cooperaj closed 1 week ago
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser("https://mastodon.scot/robots.txt")
>>> print (rp.can_fetch("FediFetcher", "https://mastodon.scot/api/v1/accounts/lookup?acct=ionafyfe"))
False
Yeah, I don't know why that is coming back false.
Yeah, this is really weird.
OK, the reason for this is in the User Agent that's being used for fetching the file:
curl -i https://mastodon.scot/robots.txt -H 'User-Agent: Python-urllib/3.11'
HTTP/2 403
[...]
alt-svc: h3=":443"; ma=86400
error code: 101
When setting the User Agent to Python-urllib/3.11
we get back a 403
. And a 403
response to a robots.txt request is usually seen as equivalent to
User-agent: *
Disallow: /
I'll see if I can change the User Agent being used by the parser.
That makes sense I guess. If you're not defining your useragent explicitly you're more likely to be something you'd want to block.
https://stackoverflow.com/a/37335469
This does not bode well.
EDIT2
https://stackoverflow.com/a/37935740
Scratch that, a little bit more complicated than the default way but nothing too tricky.
It's absurd that there is no option to change the UA for this, isn't it?!
Anyway, I also found the second option, and this shouldn't be too tricky to implement.
Closing this in favour of #127 which also deals with a change to UA.
I'm seeing robots.txt denials in my logs, this is expected though it saddens me that a User-agent: * has been implemented for quite so many services.
But I'm not sure it's 100% correct. I've a few denials that potentially shouldn't be being flagged.
But the contents of this robots.txt are
Which as you can see don't specifically block the /api/v1/statuses path for the fedifetcher User-agent. This robots.txt seems to be fairly common as well.