Closed ffigiel closed 8 years ago
IMO we need to divide it for usecases. When you call it in celery as cyclic task it doesn't matter this much. When you want it often, it does. But you don't want very general data quick. Single questions are quick. On the other hand we may as well download whole repo in some middle point and do the calculations there sending ready result.
What about issues and PRs? You can't clone that
I discovered that we have a ton of duplicate requests
Enable github.enable_console_debug_logging()
In my case there were 193 requests to 21 unique urls.
:clock1: 1h
In fact, that's a great info, cause duplicate requests that's something that we can actually fight with. On the other hand, with bad API from the GitHub we couldn't do anything.
Even if the collection took around 5-10s, it's will be a dramatic improvement compared to the current >50s
Here's the full log in case you want to play with the data. Edit: lines were sorted alphabetically gh.txt
https://github.com/rdev-hackaton/GitHubTimeTracker/blob/master/time_tracker/backends/sources/github_source.py Here is why: all group requests are written in forloops. :clock10: 10m
Yeah, it should do self.get_commits()
instead of self._repo.get_commits()
. Easy fix.
Forloops aren't the problem. I patched github.Repository
class to print names in __getattribute__
method and there's no duplication.
Repository: rdev-hackaton/GitHubTimeTracker
Loading...
get_issues
get_commits
Time: 1:00:00 Comment: None
Time: 0:10:00 Comment: None
...
Loading took 167.28886127471924
PyGithub is the problem :disappointed:
Edit: the issue persists on current PyGithub master branch.
github3.py looks very promising, I'll try to adapt our existing source to use it
We should do something about it