klangner / github-analysis

Analize github activity.
www.matrobot.com
MIT License
3 stars 3 forks source link

How to find 100 largest repos for a past date? #2

Open aronlindberg opened 11 years ago

aronlindberg commented 11 years ago

I found out how to get the 100 largest repos by forks here:

http://stackoverflow.com/questions/13745285/how-to-find-the-100-largest-github-repositories-for-a-past-date

However, that solution does not work on 2011 data. Is there a way this can be extracted?

klangner commented 11 years ago

Getting this kind of data shouldn't be a problem.

I can count all ForkEvents for each repository in 2011 data and then sort repositories by number of events.

I think I can also put any Top 100 ranking on matrobot.com site just for fun, so I can also use this kind of data.

Just to be sure: I'll count only ForkEvents from 2011 (or you want 2012 too?). It would be still possible that I miss some highly forked repository if it was forked only in 2010 (or up to 11 Feb 2011). Will it work for you?

2012/12/11 Aron Lindberg notifications@github.com

I found out how to get the 100 largest repos by forks here:

http://stackoverflow.com/questions/13745285/how-to-find-the-100-largest-github-repositories-for-a-past-date

However, that solution does not work on 2011 data. Is there a way this can be extracted?

— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2.

pozdrawiam Krzysztof

aronlindberg commented 11 years ago

For me the idea is to 1) identify the 100 largest repos by forks on March 1st 2011, 2) follow how they change on various metrics after that. I think in the githubarchive data you have a count of absolute number of forks, so you could just grab the top 100 for 2012-03-01.

klangner commented 11 years ago

Hmm. That would be a problem (I'll have to check this one). Since if you need top 100 on March 2011 it means that there is no data I can count events from.

I could count ForkEvents from March 1st to 31st 2011. And based on this select repository, but in the data there is no information about past forks.

I haven't tried Github API yet, but it doesn't look like it is possible.

2012/12/11 Aron Lindberg notifications@github.com

For me the idea is to 1) identify the 100 largest repos by forks on March 1st 2011, 2) follow how they change on various metrics after that. I think in the githubarchive data you have a count of absolute number of forks, so you could just grab the top 100 for 2012-03-01.

— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11266333.

pozdrawiam Krzysztof

aronlindberg commented 11 years ago

At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".

klangner commented 11 years ago

Fortunately no. This was added in the middle of 2012.

Here is example record from 2011 data: { "repo": { "id": 991048, "url": " https://api.github.dev/repos/cbeer/blacklight_user_generated_content", "name": "cbeer/blacklight_user_generated_content" }, "type": "PushEvent", "public": true, "created_at": "2011-03-01T00:00:00Z", "payload": { "shas": [ [ "b6b9e0729643f8debf9dbd9ec3dfcab1a6b8ffcb", "chris@cbeer.info", "document filter for comments index", "Chris Beer" ], [ "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "chris@cbeer.info", "fix up comments @document instance variable", "Chris Beer" ] ], "repo": "cbeer/blacklight_user_generated_content", "actor": "cbeer", "ref": "refs/heads/master", "size": 2, "head": "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "actor_gravatar": "604d4106c02e6e5525a7768c2f398baa", "push_id": 25733409 }, "actor": { "gravatar_id": "604d4106c02e6e5525a7768c2f398baa", "id": 111218, "url": "https://api.github.dev/users/cbeer", "avatar_url": " https://secure.gravatar.com/avatar/604d4106c02e6e5525a7768c2f398baa?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "cbeer" }, "id": "1158443437" } { "repo": { "id": 1276447, "url": "https://api.github.dev/repos/shairontoledo/json-framework", "name": "shairontoledo/json-framework" }, "type": "WatchEvent", "public": true, "created_at": "2011-03-01T00:00:02Z", "payload": { "repo": "shairontoledo/json-framework", "actor": "vianaweb", "actor_gravatar": "fdfc78b2d3f3e22bf5f810abf9a15987", "action": "started" }, "actor": { "gravatar_id": "fdfc78b2d3f3e22bf5f810abf9a15987", "id": 15512, "url": "https://api.github.dev/users/vianaweb", "avatar_url": " https://secure.gravatar.com/avatar/fdfc78b2d3f3e22bf5f810abf9a15987?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "vianaweb" }, "id": "1158443446" }

2012/12/11 Aron Lindberg notifications@github.com

At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".

— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11267049.

pozdrawiam Krzysztof

klangner commented 11 years ago

I mean unfortunately no. (typo :-) )

2012/12/11 Krzysztof Langner klangner@gmail.com

Fortunately no. This was added in the middle of 2012.

Here is example record from 2011 data: { "repo": { "id": 991048, "url": " https://api.github.dev/repos/cbeer/blacklight_user_generated_content", "name": "cbeer/blacklight_user_generated_content" }, "type": "PushEvent", "public": true, "created_at": "2011-03-01T00:00:00Z", "payload": { "shas": [ [ "b6b9e0729643f8debf9dbd9ec3dfcab1a6b8ffcb", "chris@cbeer.info", "document filter for comments index", "Chris Beer" ], [ "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "chris@cbeer.info", "fix up comments @document instance variable", "Chris Beer" ] ], "repo": "cbeer/blacklight_user_generated_content", "actor": "cbeer", "ref": "refs/heads/master", "size": 2, "head": "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "actor_gravatar": "604d4106c02e6e5525a7768c2f398baa", "push_id": 25733409 }, "actor": { "gravatar_id": "604d4106c02e6e5525a7768c2f398baa", "id": 111218, "url": "https://api.github.dev/users/cbeer", "avatar_url": " https://secure.gravatar.com/avatar/604d4106c02e6e5525a7768c2f398baa?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "cbeer" }, "id": "1158443437" } { "repo": { "id": 1276447, "url": "https://api.github.dev/repos/shairontoledo/json-framework ", "name": "shairontoledo/json-framework" }, "type": "WatchEvent", "public": true, "created_at": "2011-03-01T00:00:02Z", "payload": { "repo": "shairontoledo/json-framework", "actor": "vianaweb", "actor_gravatar": "fdfc78b2d3f3e22bf5f810abf9a15987", "action": "started" }, "actor": { "gravatar_id": "fdfc78b2d3f3e22bf5f810abf9a15987", "id": 15512, "url": "https://api.github.dev/users/vianaweb", "avatar_url": " https://secure.gravatar.com/avatar/fdfc78b2d3f3e22bf5f810abf9a15987?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "vianaweb" }, "id": "1158443446" }

2012/12/11 Aron Lindberg notifications@github.com

At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".

— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11267049.

pozdrawiam Krzysztof

pozdrawiam Krzysztof

aronlindberg commented 11 years ago

OK. Here is a different way it might be done: At the end of the day what we are interested in are the projects with the largest number of people involved. How about we simply select the projects with the largest number of PushEvents in March 2011? That might be a starting point.

klangner commented 11 years ago

It could work. Maybe even select all events? We can also try to check on data from 2012 (May) if there is correlation between number of forks (which should be available for this month) and push events. If there is then we can assume that it will work also for 2011.

I'll also check tomorrow if I can get this data from github API.

Dnia 11 gru 2012 o godz. 23:32 Aron Lindberg notifications@github.com napisał(a):

OK. Here is a different way it might be done: At the end of the day what we are interested in are the projects with the largest number of people involved. How about we simply select the projects with the largest number of PushEvents in March 2011? That might be a starting point.

— Reply to this email directly or view it on GitHub.

aronlindberg commented 11 years ago

I checked the number of pushevents, and it is problematic since you get things like twitter-loggers that create a pushevent for every tweet - not very interesting, even though it's a huge amount of pushevents.

aronlindberg commented 11 years ago

Here is an alternative data source: http://code.google.com/p/flossmole/downloads/list?can=2&q=github&colspec=Filename+Summary+Uploaded+ReleaseDate+Size+DownloadCount

I think it has counts of forks and watchers, so this could be a start for us.

klangner commented 11 years ago

Yes there are counts for forks, but the data is snapshot for database at Jun 2011 and Sep 2010. So getting data for March won't be easy. It will require getting forks for June and then subtracting back to March.

klangner commented 11 years ago

Regarding tweet pushes. It is possible to filter out outliers. Lets say projects with more then 5K push events per month

aronlindberg commented 11 years ago

I think it might be best to start with the June 2011 data. We would lose 3 months worth of data (March-June), but on the other hand we would get an unbiased estimate of "the largest repositories".

klangner commented 11 years ago

Give me few days to learn GitHub API. If I won't find there anything interesting then we can start from the data you found ok?

aronlindberg commented 11 years ago

Cool, take as much time as you need! =) I won't actually need any of this data for at least 6 weeks.