FediFetcher doesn't respect robots.txt

33b5e5 commented 9 months ago

I have this in robots.txt:

User-agent: *
Disallow: /

FediFetcher ignores this and makes endless requests to my instance.

nanos commented 9 months ago

Thanks for this @33b5e5 .

I must admit I'm not 100% convinced that FediFetcher should be following the robots.txt, as it's not crawler of indexer. I will give this some thought, and also ask the community what they think.

nanos commented 9 months ago

For reference: I've created a poll for this here:

https://mstdn.thms.uk/@michael/111203851367937130

I want to make it clear that in true British fashion this is a non-binding poll, so I may not actually follow along with the majority here (partially because properly implementing robots.txt is complicated and time consuming), but I want to strongly take the community's view into consideration.

virtulis commented 9 months ago

Not an argument for or against, but I just want to point out that you probably don't expect actual AP instances where your instances's posts are referenced to abstain from fetching them due to robots.txt.

Of course, FediFetcher is not an instance. And I think it also does go and recheck for new replies etc from time to time. And can probably be configured to do it too aggressively.

Perhaps some kind of mutually agreed rate limit (in requests per hour, not minute) would be acceptable for everyone, but that's placing some of the burden to declare it on instance admins.

This is complicated.

nanos commented 9 months ago

Thanks for your comment @virtulis

Definitely agree that it's complicated. To be fair, FediFetcher does obey rate limits. And rate limits can be configured on the server side to be different per user agent. However, each user of FediFetcher runs their own copy of FediFetcher, so there is little point in implementing something specific on FediFetcher's end, as any such rate limit wouldn't be shared across FediFetcher 'instances'.

So imo if a server wants to have stricter rate limits on fedifetcher than on other traffic then that's totally understandable, and we'll honour that. But it would be on the server admin to configure their server (as in nginx, cloudflare, whatever - not their instance software) in that way.

virtulis commented 9 months ago

Will also reference a related poll/question I had a few months prior: https://loud.computer/@virtulis/110673351562258162

The results aren't really in our favor. But using a small instance without these tools in the current state of fedi is a pretty bad experience.

nanos commented 9 months ago

Interesting. very different question though, and given the framing of algorithmic timeline not surprising at all 😬

virtulis commented 9 months ago

Indeed I framed it in the least favorable way possible.

nikdoof commented 9 months ago

I'd just implement it to be honest, just make it visible in the logs that scraping the post is being exclude by robots.txt, and make sure you're caching the robots.txt for a period of time so you're not hitting that a lot instead 😄

Admins may want to exclude their instances from 'bot' traffic, and robots.txt is a well supported and common process to do so. Much in the same way as someone may want to block Googlebot its the admin's responsibility to understand what the block will do in term so their data.

Just to clarify, i'm against the idea of using robots.txt on public visible AP instances, but giving people options is always good.

33b5e5 commented 8 months ago

Thanks for the responses and for running the poll. "FediFetcher should honour blanket bans in robots.txt" had the most votes, but not by a lot. I understand the gist of the counter arguments. FediFetcher is not exactly a bot. Still, it's not exactly a human actor on the Fediverse either, and I think there should be a way to opt-out.

Over the last 14 days FediFetcher traffic represents around 4% of the hits to my instance. It's not a big deal, but not insignificant either, especially if its use grows.

I don't think respecting a specific call-out in robots.txt while ignoring the default/wildcard makes sense. It goes against the basic logic of robots.txt. The RFC is clear about it: https://datatracker.ietf.org/doc/html/rfc9309#name-the-allow-and-disallow-line

To evaluate if access to a URI is allowed, a crawler MUST match the paths in "allow" and "disallow" rules against the URI.

So I'd maintain the blanket ban should be respected.

I understand there is an argument FediFetcher isn't a crawler. I disagree. It's automated. It's acting on behalf of individual users, but in bulk sends a lot of repetitive traffic for data that may or may not be read by a human.

I can handle it in Nginx. I have a growing pile of useragents that don't respect robots.txt which get served a 403 or 418. But I went to the trouble to open this issue because I like what the project is doing otherwise.

nanos commented 8 months ago

Thanks again for your detailed response!

There was overall 52% for - that cursed 52/48 ratio again 😁

Anyways, I have decided to implement FediFetcher honouring robots.txt. I think I'd still like to have a settings flag so that this can be turned off if a user really wants to, but with it being enabled by default.

I don't think respecting a specific call-out in robots.txt while ignoring the default/wildcard makes sense.

You are probably right. It also would mean a lot more work as I'd need to implement my own robots.txt parsing algorithm, rather than being able to use an off the shelf solution.

The other problem is time though: Implementing this is not trivial, and the structure of the script doesn't exactly lend itself to it - even worse if I want to implement caching the robots.txt, which is a must imo - as such I make no promise of a timeline, but would welcome a PR if anyone was so inclined.

likeazir commented 4 months ago

If noone is working on this, I'm probably going to have a go at this later this week

nanos commented 4 months ago

That would be amazing! I'm not currently working on this, and won't be anytime soon.

nanos commented 1 week ago

This is now implemented, and cannot be overridden by a setting: FediFetcher will respect robots.txt going forward.

nanos / FediFetcher

FediFetcher doesn't respect robots.txt #84