kronusaturn / lw2-viewer

An alternative frontend for LessWrong 2.0
https://www.greaterwrong.com/
MIT License
61 stars 6 forks source link

Suddenly increased API query rate from GreaterWrong #21

Closed jimrandomh closed 5 years ago

jimrandomh commented 5 years ago

LessWrong is seeing a significant increase in the rate of API requests from GreaterWrong, starting around April 30.

Screen Shot 2019-05-08 at 19 35 43

(The unit of this graph is "logfile lines containing the Drakma user-agent string per day", which is ~2x the number of requests.) These have a crawler-like access pattern, seemingly visiting all user pages, including those of uninteresting decade-inactive users, and clicking all the way through to their last comments page. This is unfortunately hitting an O(n^2) issue in LessWrong; satisfying a request for the Nth page of comments by a user currently involves sorting the top (pagesize*N) comments and returning only the last page.

Given the suddenness of the increase, this seems likely to be either a recently introduced bug in GW, a sudden change in GW's caching behavior, or a crawler that has recently started hitting GreaterWrong and is being passed along as-is. It's currently making up a surprising portion of our total query volume now, and is causing some performance problems on our end. While we do intend for the LW API to be fast enough to support GreaterWrong and other users, there are some things GW can do to make it easier on our end.

LessWrong's robots.txt has crawl-delay: 5 and GreaterWrong's doesn't; that may help. GreaterWrong currently splits related GraphQL queries generated by a single pageload into multiple HTTP requests, whereas LessWrong groups them together into a single POST body, which ensures that related queries reach the same server and can benefit from caching of resources that the queries share.

kronusaturn commented 5 years ago

This appears to be ArchiveTeam's bot, which does not respect robots.txt at all, and also uses an extremely aggressive request rate. I do have Disallow: /*offset= in robots.txt which avoids the O(n^2) problem for normal crawlers. I'll try to enforce this with a hard block instead of just relying on robots.txt.