madnight / githut

Github Language Statistics
https://madnight.github.io/githut
GNU Affero General Public License v3.0
961 stars 125 forks source link

More languages #1

Closed blegat closed 7 years ago

blegat commented 7 years ago

It is great that you revived the GitHub project ! Would it be possible to add more languages than the top 10 in the GitHut website ? It is good to have one clean graph with only the top 10 but it would be nice to also have the other ones below.

madnight commented 7 years ago

Thats true, maybe a ranking table for the top 30 languages

blegat commented 7 years ago

I am personally interested in Julia. It was 43rd in GitHut 1.0 so it might be in top 30 now. What is the reason to restrict it to the top 30 and not, e.g. top 50 ?

madnight commented 7 years ago

Top 50 would also be okay. The reason why i thought 30 is, its less to scroll : ) and maybe we should be a little bit selective, for example i would say that TeX or CSS is no programming language.

madnight commented 7 years ago

If you look at TIOBE, RedMonk and PYPL they have only programming languages in their ranking.

blegat commented 7 years ago

Yes, it definitely make sense to exclude HTML, TeX, CSS, RestructuredText, markdown, ...

blegat commented 7 years ago

How have you determined this list: return _.includes(['C++', 'C', 'Objective-C', 'Ruby', 'Java', 'JavaScript', 'Go', 'PHP', 'Python', 'Shell' ], name) ?

madnight commented 7 years ago

I chose them manually :innocent: yes it would be better to determine the top 10 languages automatically (based on total number of pull request)

madnight commented 7 years ago

okay i fixed that, now the top languages are generated by total pull requests sum https://github.com/madnight/githut/commit/8cf20a067cb2b4a864d3b1b0dcffbf0639835ea4 By the way my manually guess wasn't so bad, it only replaced Objective-C with C# :grimacing:

blegat commented 7 years ago

That's great. What do we do for the next 40 ? A graph with top 50 without top 10 ?

madnight commented 7 years ago

I think putting 40 languages into one graph would result in a very messy one and adding 40 seperate graphs would be an overkill. So i would prefer a TIOBE like table for the TOP 50 an im sure Julia will show up : )

madnight commented 7 years ago

So as you can see there is now a ranking table on the page:

First do you agree with the filtered languages and is there any language left that should filtered?

Second, there is a problem now, the graph data and the table are not "sync", so they show different results, for example look at bash. This is because the graph is generated from the Github Archives pull request data and the ranking table is based on a Google BigQuery december snapshot for the acutal languages per repository.

blegat commented 7 years ago

Yes the table is great :) I am looking at Google BigQuery right now and in the preview of the table "languages" they show that for each repo, there is an entry for each language with the number of bytes. Currently, even if only one byte of the repo is written in some language, this counts as one repo. That may explains the high score of Shell :-P

madnight commented 7 years ago

Hmm yes that is a good point, any ideas how to get this statistically reasonable. Maybe we should also take the byte count into the calculation of the ranking table?

blegat commented 7 years ago

Yes we might but then the more verbose one might get an advantage :-P We can also give the two columns (number of repo and number of bytes)

blegat commented 7 years ago

How did you obtain the data for the graph ?

madnight commented 7 years ago

Get language top list from Github: https://github.com/madnight/githut/blob/master/data/README.md

blegat commented 7 years ago

I meant, how did you obtain the google archive pull request data you use for the top 10 graph

madnight commented 7 years ago

Oh i made that very complicated. I did this with a shell script https://github.com/madnight/githut/blob/master/data/provision.sh And some manual processing on the shell with jq (https://stedolan.github.io/jq/) took me more than a day to download all the Github Archive data and to process it into one small static json file so that i can use it (dont try this at home). I did this because i never heard of Google BigQuery before, but now i would try to write a BigQuery SQL Statement for that. Basically what you need to do is to collect all the "payload.pull_request.base.repo.language" from Github Archive Tables like https://bigquery.cloud.google.com/table/githubarchive:year.2015?tab=preview

blegat commented 7 years ago

By comparing the languages in GitHut 1.0 and those in Githut 2.0, here are the languages that were added in 2.0 in the top 50:

madnight commented 7 years ago

Okay i think the only way to get rid of the differences from the graph and ranking is to generate them from the same dataset, therefore the ranking is now based on pull request too, with up-to-date data from month 2016/11. I think the number of pull requests is also a better indicator for popularity than file count.