Closed dnr2 closed 11 years ago
How could we provide this functionality considering that some searching APIs (such as the github API) limit the number of projects that can be retrieved?
You can use the API to get a list of all repositories on Github [1]. (you will have to assign a value to the parameter since
in order to navigate through the projects).
I will start implementing this feature using the method above, but we will still have to think about a solution to the following problems:
1 - Github limits the number of requests per hour [2]. So is groundhog going to wait until the Github API is available again? (one hour after) 2 - The number of projects on Github is too big. So what are we going to do in order to deal with the memory limits of the machines that will run groundhog. IMO we should also let the user specify (through command line) the maximum number of projects they want to download.
any ideas? @rodrigoalvesvieira , @gustavopinto ?
[1] - http://developer.github.com/v3/repos/#list-all-repositories [2] - http://developer.github.com/v3/#rate-limiting
1- well, with 5.000 request per hour you can download up to 500.000 projects (we can get 100 projects URL per request). So, it may not be a problem, since, I think, it could be very difficult to download 500.000 projects in one hour.
2- I agree. But, similarly, the user can put a huge number of projects. So, I think that groundhog should, periodically, verify if the user has enough disk space (not less than 200MB, for example).
@gustavopinto
The problem with the first solution is that Github limits the number requests to 60 per hour for unauthenticated requests. This means that we will have to, somehow, use authenticated requests and this means that the user will have to provide some kind confidential information (from his Github account) to Groundhog. Nevertheless, the currently architecture of groundhog will first search for projects and then download their content, thus your logic of downloading 500.000 projects is not applicable here. (Unless we change groundhog architecture)
P.S.: I've already created a method that searches all Github projects iteratively, but I forgot to link the commit to this issue.
Hi @dnr2,
use authenticated requests and this means that the user will have to provide [..]
The user can pass their github auth token thru command line. If not, we can throw an exception advising the user that the number of "free requests" was over.
Nevertheless, the currently architecture of groundhog will first search for projects and then download their content, thus your logic of downloading 500.000 projects is not applicable here.
Ok, that is right. But, the number of request is limited when you are using the github api
, for instance when we are listing projects. When we are downloading projects, we are not consuming the github api. In fact, we are just using github.
I've changed the name of the issue just to emphasize that we are, as of now, only worried about Github.
I've finished the implementation of this feature and it's working for Github. The only thing left to do left is to get the user's Github OAuth2 Token from command line parameter, but I will only do this after merging this branch with the master (because it now has a better way of passing parameters: through a file)
Wow! Great work, @dnr2!
Good work. What kind of search can we currently perform?
Currently Groundhog only searches for predefined projects, i.e. you have to explicitly choose the projects that will be be analysed. Groundhog must implement a generic search that will be capable of analyzing and searching for all projects within a forge.
With this functionality the user will be able to use the "*" wildcard as a parameter to search all projects.