stefansundin / rssbox

:newspaper: I consume the world via RSS feeds, and this is my attempt to keep it that way.
https://github.com/stefansundin/rssbox/discussions/64
GNU Affero General Public License v3.0
762 stars 73 forks source link

Twitter abuse #38

Closed stefansundin closed 1 month ago

stefansundin commented 4 years ago

After years of having no real issues, someone has recently started fetching a ton of Twitter feeds, 3x-4x the normal volume. Just have a look here.

Screen Shot 2020-04-12 at 10 53 27 PM

I am only able to do 1500 Twitter API calls per 15 minutes, and I am running out all the time now. So if you're wondering why you are getting "There was a problem talking to Twitter. Please try again in a moment.", this is why.

I will most likely have to start blocking certain Twitter users to get this situation under control. So if you are refreshing a feed too often, it may stop working. So if you want to keep a feed usable, then it is in your best interest to not hammer it.

Todaug commented 4 years ago

Could you technically cache the data received from an RSS subscription for a defined period of time? Suppose you cache the data for 1 minute and the subscriber triggers the fecthing 5 times in that minute, you could technically prevent 5 API calls.

stefansundin commented 4 years ago

Yes. The easiest fix is to just put the site behind a CDN. That would resolve most problems.

Ever since creating creating the app (5 years ago) it has been a fun challenge to try to run it on the Heroku free tier, which only offers the smallest dyno they have with very limited amount of RAM (and 25 MB of Redis storage). And the app is still usable, but at times it has issues due to the number of requests that it now receives. I've also made it super simple to deploy to your own Heroku account, but to my knowledge only a couple of people have done that.

As for a CDN, the problem is that there really isn't any free CDNs out there. And it would be impossible to have the herokuapp.com domain go to a CDN, so the transition would be hard. I don't think it would be that feasible to do my own caching solution.

I don't know exactly what I will end up doing. I don't really want to make it a paid service, or require sign up.

stefansundin commented 3 years ago

So here's the result of the caching work (#43). Ever since I deployed v412 12 days ago, things have been really solid. Instagram is still not working well (#39), but everything else is great. The app is finally stable, which it hasn't been for a very long time lol. And the best thing is that I'm still hosting it on the free Heroku tier (which gives you an idea of how well you can optimize something and still offer it for free).

Here's the graph of outgoing requests to Twitter over the last year:

Screen Shot 2021-01-26 at 18 42 41

And here's the available Twitter ratelimit over the same time period:

Screen Shot 2021-01-26 at 18 39 30

The data is currently cached for 1 hour, so your results may not contain tweets from the last hour. To find out when the tweets were last fetched from Twitter, look at the main feed <updated>2021-01-27T01:43:04Z</updated> timestamp. This timestamp used to be the timestamp of the last tweet by the user, but now it represents when the data was put in the cache. If you self-host your own RSS Box, you can change this on this line:

https://github.com/stefansundin/rssbox/blob/0ddc71357618d18c941bac9f779c6d0aa98c94de/app.rb#L167

The 60*60 means that a successful response from Twitter is cached for one hour. The next argument, 60 seconds, says how long an error should be cached (an exception was thrown, Twitter is down, etc).

At some point I was to implement a smarter system where users that are more likely to tweet a lot are cached for less time. Users that don't tweet that often are cached for longer. But I'd also need to record the request rate over time, to ensure that the service is still stable even if there are many users that tweet a lot. I never want that ratelimit graph to come close to zero again. It may be a while before I implement this however, and I expect usage to pick up now that things work again and only time will tell exactly how much.

Anyway, great success, I hope you are enjoying it.

P.S. The URL resolution code has been disabled for the moment. I need to rework it.

stefansundin commented 2 years ago

We hit another milestone. The ratelimit was exhausted early this morning for the first time since the caching was added.

Screen Shot 2021-11-17 at 10 35 02

The Heroku server restarted where the yellow line disappears, and no one looked up a user (using the form on the main page) until ~4:45. When the Heroku server restarts, it looses all the cache, which means all requests will result in a request to Twitter. As the cache is built up, fewer requests result in requests being sent to Twitter and the ratelimit is not exhausted as quickly.

Long terms trends:

Screen Shot 2021-11-17 at 10 38 11

I probably need to tweak things a little bit more. Maybe intentionally fail a percentage of requests after the server restarts. I have been thinking of an advanced system to store the cache between Heroku dyno restarts, but I am not sure I want to spend the time to write it.