madnight / githut

Github Language Statistics
https://madnight.github.io/githut
GNU Affero General Public License v3.0
967 stars 125 forks source link

Counting extensions and file sizes of project folders on GitHub to recommend changes #48

Closed RichardKCollins closed 3 years ago

RichardKCollins commented 3 years ago

Over the past several years I have been studying groups on GitHub. Part of the Internet Foundation studies for global communities on the Internet for the past 23 years.

Much of the human cost of learning these various projects, for any group size, is dealing with the many and different formats. A mature project like LLVM-Project (see attached) has what I see as the normal human capabilities curve (where the file sizes are log normal with a peak around 1000 characters). And a long tail of exceptions for filetypes that are seldom used, but often critical.

I see you using a database that I have not seen or tried yet. I have been downloading source code folders to my computer, (I am also reading, parsing and analyzing the contents of the files). I would like to put these kinds of tools and analyses where others can try them. I mostly use Javsascript (with a localhost for access to hardware and file system services).

Would it be too much to get a directory of all files in all projects on GitHub so I can analyze the maturity and character of them all? I can tell a lot about the learning curve and costs involved for a project just by looking at the source code folder and repositories.

I want to recommend different practices for GitHub and these kinds of projects on the internet. The current practices (global) are wasting too much human time and delaying response of things like "covid", "global climate change", "deforestation", "online education" and others. I have about 20,000 global communities that I have investigated to see why the stall or die or simply take years to do something that can be done in a few days with the proper tools.

I talked about some of the related issues in a video I made yesterday and mentioned where this kind of analysis might fit into the larger picture.

New Video: Energy Office of Science, PNNL Article, Climate Model, Sharing https://theinternetfoundation.net/?p=1347

Richard K Collins, Director, The Internet Foundation

Counts of Extensions and Log FileSizes for LLVM-Project Counts Plots of Sourcecode folder extensions and slzes Exts Log10ths.xlsx

madnight commented 3 years ago

Hi @RichardKCollins,

this is unfortunately completely out of scope of this project. This project solely focuses on programming language popularity of GitHub users.

I would like to refer you to https://www.gharchive.org and the public github dataset on Google BigQuery for your analysis.

RichardKCollins commented 3 years ago

Fabian, Thanks for the blazing fast reply!! I knew there was probably a database somewhere, but did not know the magic words to search for. You saved me lots of time. I have been parsing many of the most used languages on the Internet because large communities get stuck when they work on the same global problems, but use different computer languages. I am trying to estimate the total cost for critical project like "covid". Have you tried to profile the experience level of users for different languages? Or the total time each one spends using the different languages? I can look for these things. I am just curious if you have related interests.

Richard