Robots.txt parsing might not be 100% correct

nanos / FediFetcher

FediFetcher is a tool for Mastodon that automatically fetches missing replies and posts from other fediverse instances, and adds them to your own Mastodon instance.

https://blog.thms.uk/fedifetcher?utm_source=github

MIT License

293 stars 213 forks source link

Robots.txt parsing might not be 100% correct #126

Closed cooperaj closed 1 week ago

cooperaj commented 1 week ago

I'm seeing robots.txt denials in my logs, this is expected though it saddens me that a User-agent: * has been implemented for quite so many services.

But I'm not sure it's 100% correct. I've a few denials that potentially shouldn't be being flagged.

Error getting context for toot https://social.vivaldi.net/@cmicmuir/112674655146822031. 
Exception: Querying https://social.vivaldi.net/api/v1/statuses/112674655146822031/context prohibited by robots.txt

But the contents of this robots.txt are

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file

User-agent: GPTBot
Disallow: /

User-agent: *
Disallow: /media_proxy/
Disallow: /interact/

Which as you can see don't specifically block the /api/v1/statuses path for the fedifetcher User-agent. This robots.txt seems to be fairly common as well.

cooperaj commented 1 week ago

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser("https://mastodon.scot/robots.txt")
>>> print (rp.can_fetch("FediFetcher", "https://mastodon.scot/api/v1/accounts/lookup?acct=ionafyfe"))
False

Yeah, I don't know why that is coming back false.

nanos commented 1 week ago

Yeah, this is really weird.

nanos commented 1 week ago

OK, the reason for this is in the User Agent that's being used for fetching the file:

curl -i  https://mastodon.scot/robots.txt -H 'User-Agent: Python-urllib/3.11'
HTTP/2 403
[...]
alt-svc: h3=":443"; ma=86400

error code: 101

When setting the User Agent to Python-urllib/3.11 we get back a 403. And a 403 response to a robots.txt request is usually seen as equivalent to

User-agent: *
Disallow: /

I'll see if I can change the User Agent being used by the parser.

cooperaj commented 1 week ago

That makes sense I guess. If you're not defining your useragent explicitly you're more likely to be something you'd want to block.

https://stackoverflow.com/a/37335469

This does not bode well.

EDIT2

https://stackoverflow.com/a/37935740

Scratch that, a little bit more complicated than the default way but nothing too tricky.

nanos commented 1 week ago

It's absurd that there is no option to change the UA for this, isn't it?!

Anyway, I also found the second option, and this shouldn't be too tricky to implement.

Closing this in favour of #127 which also deals with a change to UA.