‘Rate-Limit’ FediFetcher

nanos commented 1 week ago

This is essentially a follow up to #122

During that discussion the question was raised whether FediFetcher needs to re-fetch context for the same post over and over again.

Since context can change when a new reply is added we can’t simply never re-request context. But we also don’t need to re-check for new replies 100 times per hour when a year old post has been boosted into someone’s timeline.

So, in the ideal world we’ll have some sort of back off mechanism where

1) for posts less than an hour old we allow checking context up to once per minute. 2) posts between 1 and 24 hours old we allow checking for context once ever 10 min 3) for posts older than 24 hours we allow checking context up to once per day.

I have no idea how feasible this is (some architectural decisions taken early on in FediFetcher‘s development do make this challenging) but clearly the current status cannot continue.

I’d appreciate any input both by users and by admins who may be fed up with FediFetcher querying their instance too much.

(Mentioning @colinstu12 and @p37307 as you’ve both shown interest in this discussion previously)

p37307 commented 1 week ago

It is probably best to add rate limits, and to work on a robots.txt courtesy. https://github.com/nanos/FediFetcher/issues/84. There clearly is a need now for this option so FediFetcher can be stopped so instances who run it are not ip blocked.

nanos commented 1 week ago

Agreed. I’ve begun work on this one. #84 is proving to be tricky as parsing robots.txt is very very hard, and I haven’t yet found a good python package that simplifies this. (Nor do I think that fetching robots.txt over and over again would be a great solution, so we’d need to cache that too…)

colinstu12 commented 1 week ago

1. for posts less than an hour old we allow checking context up to once per minute.
2. posts between 1 and 24 hours old we allow checking for context once ever 10 min
3. for posts older than 24 hours we allow checking context up to once per day.

Agreed, this sounds like a great first step. Would there also be a way to make this the default behavior of some sort but then allow additional parameters to adjust the frequency manually for these three as well? And/or adjust the hour ranges of these as well? I'm personally feeling more like, posts <6hrs old - default timeout period (whatever it is currently before any changes, none? otherwise once per min sounds like more than enough...maybe 5min), posts >6-<24hr check context ever hour, posts >24hrs-48hrs, check context every six hours, >48hrs check once per day. However this ends up working, as long as there is configurability so folks can fine-tune it to their personal desires, this will be amazing.

nanos commented 1 week ago

Thanks @colinstu12 I'm not sure on configurability: It would seem like we'd need a lot of configuration options to make this work, which would seem very confusing.

Can you think of a way of implementing configurability without being confusing / overwhelming?

FYI: I've just merged #124 which implements a first stab at this. Would be great if people could try this out and give feedbac

colinstu12 commented 1 week ago

Option	Required?	Notes
`context-timeout-short`	No	Provide to adjust the timeout period of checking context of posts less than an hour old. Defaults to `1` for up to once per minute.
`context-timeout-medium`	No	Provide to adjust the timeout period of checking context of posts between one and 24 hours old. Defaults to `10` for up to once per ten minutes.
`context-timeout-long`	No	Provide to adjust the timeout period of checking context of posts greater than 24 hours old. Defaults to `1440` for up to once per day.

This will at least allow adjustment of the number of times per each of these already committed time ranges. As far as adjusting the time ranges themselves I worry about folks potentially having overlap in times, or gaps in time, causing god-knows-what kind of issues... so maybe it's best to skip. (unless this error checking isn't terrible/difficult to implement, or if there are gaps it can fall back to default behavior, or something).

Actually maybe it could be something like this, and it would eliminate the possibility of gaps or overlap.

Option	Required?	Notes
`timeout-range-short`	No	Provide to adjust the length of time that the `context-timeout-short` parameter checks. Defaults to `60` for checking context of posts less than an hour old.
`timeout-range-medium`	No	Provide to adjust the length of time that the `context-timeout-medium` parameter checks. Note that you are adjusting the end of the time range, the beginning of this time range is automatically adjusted based on `timeout-range-short` or the default if not configured. Defaults to `1440` for checking context of posts less than 24 hour old.

I guess there isn't any need to adjust the "long" one, because it will always be whatever is older / falls out of the range of the "medium" time period.

nanos commented 1 week ago

Thanks @colinstu12

I must say that I'm still not convinced this adds much value, or is easy to understand. Will give it some more thought.

p37307 commented 1 week ago

FYI, I am paul on Mastodon.

Re my high traffic and numbers of users, You probably are going to like what I have to say. You can disregard the stats I posted in that thread about my instance.

After the recent update to FediFetcher, where the server name is added to the User-Agent, the crawls are coming from within. It also appears each time it is run, it is being counted as a unique visitor.

I don't know what is going on with leah at chaos.social, maybe their users have it setup via api dev and the high number of crawls are coming from their users within and not outside. Only time will tell when the server address is added to the fetcher in their log, if users update.

You probably need to change the user agent to include, ran by user from +servername so people don't think it is masking their server's name, esp since it can be run as a github action. Something like, FediFetcher (Ran from a user at +oldfriends.live; https://go.thms.uk/ff)

nanos commented 1 week ago

This is extremely helpful @p37307 ! Thanks for that!

I was actually almost expecting this to be the case, but I'm relieved to have this confirmed. I guess this at least means instance admins can make informed decisions to disallow their own users' use of FediFetcher, if they feel that's needed, or potentially prod those users into collaborating (Running one FediFetcher instance with multiple Api Keys is much more efficient, than every user running their own instance.)

You probably need to change the user agent to include, ran by user from +servername so people don't think it is masking their server's name, esp since it can be run as a github action.

Valid point. I was actually thinking of including the user's name. But then that raises potentially serious privacy concerns. I don't like your suggestion too much though - it's just so verbose. 🤔

p37307 commented 1 week ago

I was actually thinking of including the user's name. But then that raises potentially serious privacy concerns.

...esp since the IP and then the username would be paired. Not an issue if it's only appearing in the logs of the user's server since they already see the logged in users IP but then when you have it writing it as a pair in a access log at other instances... that would give one side, esp if hostile, the connection to the users account and IP, could be bad esp if the Fedifetcher user is running it from their home computer, like on Windows 10 via python. Not optimal.

it's just so verbose.

I can be a dull person. :)

leahoswald commented 1 week ago

Hi leah from chaos.social here. These are the numbers of requests per endpoint for 19 hours of chaos.social using the UA FediFetcher:

    268 /nodeinfo/2.0
    268 /.well-known/nodeinfo
   1844 /api/v1/accounts/<id>/statuses?limit=40
   1845 /api/v1/accounts/lookup?acct=<account name>
 123485 /api/v1/statuses/<id>/context

nanos commented 1 week ago

Thanks @leahoswald . Again, appreciate you taking the time to respond!

nanos commented 1 week ago

So, I thought I'd mention a couple of stats here as well:

I'm currently testing this, and it had the following impact:

The number of requests to the context endpoint has fallen from about 12k per hour to about 120 per hour. That's a drop of 2 orders of magnitude.
Whilst this primarily benefits the remote instances, it also benefits mine: the average run duration has dropped from about 3 minutes to about 15 seconds (and that's despite the robots.txt check as well).

nanos commented 1 week ago

I’m closing this issue now, as I’ve just released v7.1.0 which includes this.

@leahoswald @p37307 id love to have some feedback from you in a few days/weeks time, to see what difference (if any) these changes have made.

nanos / FediFetcher

‘Rate-Limit’ FediFetcher #123