Rate Limit for Crawler - Githubissues

Nutomic commented 5 years ago

Hi,

I noticed today that I was getting a lot of requests on peertube.social, for URLs like /api/v1/videos/video-id. At times I was getting around 50 requests per second, and this caused a ton of CPU usage. Now I dont know if this was you, but it definitely looked like a crawl and apparently your site started on the same day, so it seems likely.

The problem is gone for now, probably because the crawler has finished its backlog. But you should definitely add a rate limit to your crawler if you havent already. I suggest something like 1 request per second at most.

silicium14 commented 5 years ago

Hello,

You can tell if a request comes from PeerTube Index by checking:

the user agent of the HTTP request: I use PeertubeIndex (this it is not really reliable because anyone could be using this user agent)
the source IP address of the request: it has to be the IP returned by a domain name lookup of peertube-index.net.

About the specific spike of requests you noticed, I believe it was not from PeerTube Index because the crawler does not visit the /api/v1/videos/video-id URLs to fetch videos. It uses the /api/v1/videos endpoint, requesting all the available pages with a page size of 100:

GET /api/v1/videos?count=100&start=0
GET /api/v1/videos?count=100&start=100
GET /api/v1/videos?count=100&start=200
...

Moreover, PeerTube Index has been already up and crawling for several months now, scanning its known PeerTube instances every day.

As for limiting the rate of requests sent to an instance being scanned, I decided that requests going to a specific instance should be made sequentially. Therefore there is only one request at time going from the PeerTube Index crawler to a particular instance being scanned. This may definitely cause more that one request per second but I believe this is acceptable.

Nutomic commented 5 years ago

Okay then sorry to bother you, and thanks for the information :)

silicium14 / peertube_index

Rate Limit for Crawler #1