silicium14 / peertube_index

A centralized search engine for PeerTube videos
https://peertube-index.net
MIT License
26 stars 0 forks source link

Rate Limit for Crawler #1

Closed Nutomic closed 5 years ago

Nutomic commented 5 years ago

Hi,

I noticed today that I was getting a lot of requests on peertube.social, for URLs like /api/v1/videos/video-id. At times I was getting around 50 requests per second, and this caused a ton of CPU usage. Now I dont know if this was you, but it definitely looked like a crawl and apparently your site started on the same day, so it seems likely.

The problem is gone for now, probably because the crawler has finished its backlog. But you should definitely add a rate limit to your crawler if you havent already. I suggest something like 1 request per second at most.

silicium14 commented 5 years ago

Hello,

You can tell if a request comes from PeerTube Index by checking:

About the specific spike of requests you noticed, I believe it was not from PeerTube Index because the crawler does not visit the /api/v1/videos/video-id URLs to fetch videos. It uses the /api/v1/videos endpoint, requesting all the available pages with a page size of 100:

Moreover, PeerTube Index has been already up and crawling for several months now, scanning its known PeerTube instances every day.

As for limiting the rate of requests sent to an instance being scanned, I decided that requests going to a specific instance should be made sequentially. Therefore there is only one request at time going from the PeerTube Index crawler to a particular instance being scanned. This may definitely cause more that one request per second but I believe this is acceptable.

Nutomic commented 5 years ago

Okay then sorry to bother you, and thanks for the information :)