Open aronlindberg opened 11 years ago
Getting this kind of data shouldn't be a problem.
I can count all ForkEvents for each repository in 2011 data and then sort repositories by number of events.
I think I can also put any Top 100 ranking on matrobot.com site just for fun, so I can also use this kind of data.
Just to be sure: I'll count only ForkEvents from 2011 (or you want 2012 too?). It would be still possible that I miss some highly forked repository if it was forked only in 2010 (or up to 11 Feb 2011). Will it work for you?
2012/12/11 Aron Lindberg notifications@github.com
I found out how to get the 100 largest repos by forks here:
However, that solution does not work on 2011 data. Is there a way this can be extracted?
— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2.
pozdrawiam Krzysztof
For me the idea is to 1) identify the 100 largest repos by forks on March 1st 2011, 2) follow how they change on various metrics after that. I think in the githubarchive data you have a count of absolute number of forks, so you could just grab the top 100 for 2012-03-01.
Hmm. That would be a problem (I'll have to check this one). Since if you need top 100 on March 2011 it means that there is no data I can count events from.
I could count ForkEvents from March 1st to 31st 2011. And based on this select repository, but in the data there is no information about past forks.
I haven't tried Github API yet, but it doesn't look like it is possible.
2012/12/11 Aron Lindberg notifications@github.com
For me the idea is to 1) identify the 100 largest repos by forks on March 1st 2011, 2) follow how they change on various metrics after that. I think in the githubarchive data you have a count of absolute number of forks, so you could just grab the top 100 for 2012-03-01.
— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11266333.
pozdrawiam Krzysztof
At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".
Fortunately no. This was added in the middle of 2012.
Here is example record from 2011 data: { "repo": { "id": 991048, "url": " https://api.github.dev/repos/cbeer/blacklight_user_generated_content", "name": "cbeer/blacklight_user_generated_content" }, "type": "PushEvent", "public": true, "created_at": "2011-03-01T00:00:00Z", "payload": { "shas": [ [ "b6b9e0729643f8debf9dbd9ec3dfcab1a6b8ffcb", "chris@cbeer.info", "document filter for comments index", "Chris Beer" ], [ "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "chris@cbeer.info", "fix up comments @document instance variable", "Chris Beer" ] ], "repo": "cbeer/blacklight_user_generated_content", "actor": "cbeer", "ref": "refs/heads/master", "size": 2, "head": "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "actor_gravatar": "604d4106c02e6e5525a7768c2f398baa", "push_id": 25733409 }, "actor": { "gravatar_id": "604d4106c02e6e5525a7768c2f398baa", "id": 111218, "url": "https://api.github.dev/users/cbeer", "avatar_url": " https://secure.gravatar.com/avatar/604d4106c02e6e5525a7768c2f398baa?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "cbeer" }, "id": "1158443437" } { "repo": { "id": 1276447, "url": "https://api.github.dev/repos/shairontoledo/json-framework", "name": "shairontoledo/json-framework" }, "type": "WatchEvent", "public": true, "created_at": "2011-03-01T00:00:02Z", "payload": { "repo": "shairontoledo/json-framework", "actor": "vianaweb", "actor_gravatar": "fdfc78b2d3f3e22bf5f810abf9a15987", "action": "started" }, "actor": { "gravatar_id": "fdfc78b2d3f3e22bf5f810abf9a15987", "id": 15512, "url": "https://api.github.dev/users/vianaweb", "avatar_url": " https://secure.gravatar.com/avatar/fdfc78b2d3f3e22bf5f810abf9a15987?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "vianaweb" }, "id": "1158443446" }
2012/12/11 Aron Lindberg notifications@github.com
At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".
— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11267049.
pozdrawiam Krzysztof
I mean unfortunately no. (typo :-) )
2012/12/11 Krzysztof Langner klangner@gmail.com
Fortunately no. This was added in the middle of 2012.
Here is example record from 2011 data: { "repo": { "id": 991048, "url": " https://api.github.dev/repos/cbeer/blacklight_user_generated_content", "name": "cbeer/blacklight_user_generated_content" }, "type": "PushEvent", "public": true, "created_at": "2011-03-01T00:00:00Z", "payload": { "shas": [ [ "b6b9e0729643f8debf9dbd9ec3dfcab1a6b8ffcb", "chris@cbeer.info", "document filter for comments index", "Chris Beer" ], [ "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "chris@cbeer.info", "fix up comments @document instance variable", "Chris Beer" ] ], "repo": "cbeer/blacklight_user_generated_content", "actor": "cbeer", "ref": "refs/heads/master", "size": 2, "head": "8ca5cf297e3b5bbdd19654ca289ede782a82ac1f", "actor_gravatar": "604d4106c02e6e5525a7768c2f398baa", "push_id": 25733409 }, "actor": { "gravatar_id": "604d4106c02e6e5525a7768c2f398baa", "id": 111218, "url": "https://api.github.dev/users/cbeer", "avatar_url": " https://secure.gravatar.com/avatar/604d4106c02e6e5525a7768c2f398baa?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "cbeer" }, "id": "1158443437" } { "repo": { "id": 1276447, "url": "https://api.github.dev/repos/shairontoledo/json-framework ", "name": "shairontoledo/json-framework" }, "type": "WatchEvent", "public": true, "created_at": "2011-03-01T00:00:02Z", "payload": { "repo": "shairontoledo/json-framework", "actor": "vianaweb", "actor_gravatar": "fdfc78b2d3f3e22bf5f810abf9a15987", "action": "started" }, "actor": { "gravatar_id": "fdfc78b2d3f3e22bf5f810abf9a15987", "id": 15512, "url": "https://api.github.dev/users/vianaweb", "avatar_url": " https://secure.gravatar.com/avatar/fdfc78b2d3f3e22bf5f810abf9a15987?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png ", "login": "vianaweb" }, "id": "1158443446" }
2012/12/11 Aron Lindberg notifications@github.com
At least for 2012 there are "repository_forks" available as data. Is this part of the JSON not available for 2011? Just counting forks from march won't give us a true picture of the "largest repos on GitHub".
— Reply to this email directly or view it on GitHubhttps://github.com/klangner/github-analysis/issues/2#issuecomment-11267049.
pozdrawiam Krzysztof
pozdrawiam Krzysztof
OK. Here is a different way it might be done: At the end of the day what we are interested in are the projects with the largest number of people involved. How about we simply select the projects with the largest number of PushEvents in March 2011? That might be a starting point.
It could work. Maybe even select all events? We can also try to check on data from 2012 (May) if there is correlation between number of forks (which should be available for this month) and push events. If there is then we can assume that it will work also for 2011.
I'll also check tomorrow if I can get this data from github API.
Dnia 11 gru 2012 o godz. 23:32 Aron Lindberg notifications@github.com napisał(a):
OK. Here is a different way it might be done: At the end of the day what we are interested in are the projects with the largest number of people involved. How about we simply select the projects with the largest number of PushEvents in March 2011? That might be a starting point.
— Reply to this email directly or view it on GitHub.
I checked the number of pushevents, and it is problematic since you get things like twitter-loggers that create a pushevent for every tweet - not very interesting, even though it's a huge amount of pushevents.
Here is an alternative data source: http://code.google.com/p/flossmole/downloads/list?can=2&q=github&colspec=Filename+Summary+Uploaded+ReleaseDate+Size+DownloadCount
I think it has counts of forks and watchers, so this could be a start for us.
Yes there are counts for forks, but the data is snapshot for database at Jun 2011 and Sep 2010. So getting data for March won't be easy. It will require getting forks for June and then subtracting back to March.
Regarding tweet pushes. It is possible to filter out outliers. Lets say projects with more then 5K push events per month
I think it might be best to start with the June 2011 data. We would lose 3 months worth of data (March-June), but on the other hand we would get an unbiased estimate of "the largest repositories".
Give me few days to learn GitHub API. If I won't find there anything interesting then we can start from the data you found ok?
Cool, take as much time as you need! =) I won't actually need any of this data for at least 6 weeks.
I found out how to get the 100 largest repos by forks here:
http://stackoverflow.com/questions/13745285/how-to-find-the-100-largest-github-repositories-for-a-past-date
However, that solution does not work on 2011 data. Is there a way this can be extracted?