Closed blegat closed 7 years ago
Thats true, maybe a ranking table for the top 30 languages
I am personally interested in Julia. It was 43rd in GitHut 1.0 so it might be in top 30 now. What is the reason to restrict it to the top 30 and not, e.g. top 50 ?
Top 50 would also be okay. The reason why i thought 30 is, its less to scroll : ) and maybe we should be a little bit selective, for example i would say that TeX or CSS is no programming language.
If you look at TIOBE, RedMonk and PYPL they have only programming languages in their ranking.
Yes, it definitely make sense to exclude HTML, TeX, CSS, RestructuredText, markdown, ...
How have you determined this list: return _.includes(['C++', 'C', 'Objective-C', 'Ruby', 'Java', 'JavaScript', 'Go', 'PHP', 'Python', 'Shell' ], name)
?
I chose them manually :innocent: yes it would be better to determine the top 10 languages automatically (based on total number of pull request)
okay i fixed that, now the top languages are generated by total pull requests sum https://github.com/madnight/githut/commit/8cf20a067cb2b4a864d3b1b0dcffbf0639835ea4 By the way my manually guess wasn't so bad, it only replaced Objective-C with C# :grimacing:
That's great. What do we do for the next 40 ? A graph with top 50 without top 10 ?
I think putting 40 languages into one graph would result in a very messy one and adding 40 seperate graphs would be an overkill. So i would prefer a TIOBE like table for the TOP 50 an im sure Julia will show up : )
So as you can see there is now a ranking table on the page:
First do you agree with the filtered languages and is there any language left that should filtered?
Second, there is a problem now, the graph data and the table are not "sync", so they show different results, for example look at bash. This is because the graph is generated from the Github Archives pull request data and the ranking table is based on a Google BigQuery december snapshot for the acutal languages per repository.
Yes the table is great :) I am looking at Google BigQuery right now and in the preview of the table "languages" they show that for each repo, there is an entry for each language with the number of bytes. Currently, even if only one byte of the repo is written in some language, this counts as one repo. That may explains the high score of Shell :-P
Hmm yes that is a good point, any ideas how to get this statistically reasonable. Maybe we should also take the byte count into the calculation of the ranking table?
Yes we might but then the more verbose one might get an advantage :-P We can also give the two columns (number of repo and number of bytes)
How did you obtain the data for the graph ?
Get language top list from Github: https://github.com/madnight/githut/blob/master/data/README.md
I meant, how did you obtain the google archive pull request data you use for the top 10 graph
Oh i made that very complicated. I did this with a shell script https://github.com/madnight/githut/blob/master/data/provision.sh And some manual processing on the shell with jq (https://stedolan.github.io/jq/) took me more than a day to download all the Github Archive data and to process it into one small static json file so that i can use it (dont try this at home). I did this because i never heard of Google BigQuery before, but now i would try to write a BigQuery SQL Statement for that. Basically what you need to do is to collect all the "payload.pull_request.base.repo.language" from Github Archive Tables like https://bigquery.cloud.google.com/table/githubarchive:year.2015?tab=preview
By comparing the languages in GitHut 1.0 and those in Githut 2.0, here are the languages that were added in 2.0 in the top 50:
Okay i think the only way to get rid of the differences from the graph and ranking is to generate them from the same dataset, therefore the ranking is now based on pull request too, with up-to-date data from month 2016/11. I think the number of pull requests is also a better indicator for popularity than file count.
It is great that you revived the GitHub project ! Would it be possible to add more languages than the top 10 in the GitHut website ? It is good to have one clean graph with only the top 10 but it would be nice to also have the other ones below.