pjotrp / biogems.info

Tools for keeping track of biogems. Moved to https://git.thebird.nl/free/biogems.info
http://biogems.info/
13 stars 13 forks source link

Connection caching #49

Closed mamarjan closed 10 years ago

mamarjan commented 11 years ago

From what I can tell, the methods in http.rb open a new connection each time they're used. That means for example that for each of the 100+ requests to GitHub API, a new connection is made.

Using some kind of cache (probably just a hash of open connections per method (http/https), with the host names being keys), should speed up site generation considerably.

pjotrp commented 11 years ago

Worth trying - especially with authentication included.

mamarjan commented 11 years ago

After a couple of hours trying, all I got was something like a two minute saving - 18 minutes instead of 20. I also had to move to Ruby 2.0.0 for that. I believe the real solution would be to refactor the code to fetch data in parallel, using something like this:

https://github.com/typhoeus/typhoeus

pjotrp commented 11 years ago

Maybe. It appears to me it that the rubygems and github APIs are just slow, most likely on purpose. If they are throttling, parallel won't help. I don't think our code is terribly slow, but only a profiler can show that.

pjotrp commented 11 years ago

I think the way forward is to cache all items. For example stargazers need only be updated every other day. Issues daily - we don't have to update everything every time. I'll probably implement that next time round.

mamarjan commented 11 years ago

I don't think you need a profiler for that. See this output from the time command, while running "./create_website.sh";

real 20m5.076s user 1m45.683s sys 0m32.478s

I guess that means that the actual processing takes a bit over 2 minutes, and the rest is spent waiting for answers to HTTP get requests.

pjotrp commented 11 years ago

Both Ruby code and http waiting happens in user. I presume it is http, but there is only one way to be sure :)