spgroup / groundhog

A framework for crawling GitHub projects and raw data and to extract metrics from them
http://spgroup.github.io/groundhog
GNU General Public License v2.0
15 stars 10 forks source link

Implement "search all projects" #19

Closed dnr2 closed 11 years ago

dnr2 commented 11 years ago

Currently Groundhog only searches for predefined projects, i.e. you have to explicitly choose the projects that will be be analysed. Groundhog must implement a generic search that will be capable of analyzing and searching for all projects within a forge.

With this functionality the user will be able to use the "*" wildcard as a parameter to search all projects.

dnr2 commented 11 years ago

How could we provide this functionality considering that some searching APIs (such as the github API) limit the number of projects that can be retrieved?

dnr2 commented 11 years ago

You can use the API to get a list of all repositories on Github [1]. (you will have to assign a value to the parameter since in order to navigate through the projects).

I will start implementing this feature using the method above, but we will still have to think about a solution to the following problems:

1 - Github limits the number of requests per hour [2]. So is groundhog going to wait until the Github API is available again? (one hour after) 2 - The number of projects on Github is too big. So what are we going to do in order to deal with the memory limits of the machines that will run groundhog. IMO we should also let the user specify (through command line) the maximum number of projects they want to download.

any ideas? @rodrigoalvesvieira , @gustavopinto ?

[1] - http://developer.github.com/v3/repos/#list-all-repositories [2] - http://developer.github.com/v3/#rate-limiting

gustavopinto commented 11 years ago

1- well, with 5.000 request per hour you can download up to 500.000 projects (we can get 100 projects URL per request). So, it may not be a problem, since, I think, it could be very difficult to download 500.000 projects in one hour.

2- I agree. But, similarly, the user can put a huge number of projects. So, I think that groundhog should, periodically, verify if the user has enough disk space (not less than 200MB, for example).

dnr2 commented 11 years ago

@gustavopinto

The problem with the first solution is that Github limits the number requests to 60 per hour for unauthenticated requests. This means that we will have to, somehow, use authenticated requests and this means that the user will have to provide some kind confidential information (from his Github account) to Groundhog. Nevertheless, the currently architecture of groundhog will first search for projects and then download their content, thus your logic of downloading 500.000 projects is not applicable here. (Unless we change groundhog architecture)

P.S.: I've already created a method that searches all Github projects iteratively, but I forgot to link the commit to this issue.

gustavopinto commented 11 years ago

Hi @dnr2,

use authenticated requests and this means that the user will have to provide [..]

The user can pass their github auth token thru command line. If not, we can throw an exception advising the user that the number of "free requests" was over.

Nevertheless, the currently architecture of groundhog will first search for projects and then download their content, thus your logic of downloading 500.000 projects is not applicable here.

Ok, that is right. But, the number of request is limited when you are using the github api, for instance when we are listing projects. When we are downloading projects, we are not consuming the github api. In fact, we are just using github.

fernandocastor commented 11 years ago

I've changed the name of the issue just to emphasize that we are, as of now, only worried about Github.

dnr2 commented 11 years ago

I've finished the implementation of this feature and it's working for Github. The only thing left to do left is to get the user's Github OAuth2 Token from command line parameter, but I will only do this after merging this branch with the master (because it now has a better way of passing parameters: through a file)

rodrigoalvesvieira commented 11 years ago

Wow! Great work, @dnr2!

fernandocastor commented 11 years ago

Good work. What kind of search can we currently perform?