madnight / githut

Github Language Statistics
https://madnight.github.io/githut
GNU Affero General Public License v3.0
967 stars 125 forks source link

query.js is biased towards more verbose languages (winner-takes-all) #113

Closed lemmy closed 7 months ago

lemmy commented 8 months ago

The sub-query selects each GitHub repository's name and identifies the primary programming language used in that repo based on the highest byte count in the repository.

https://github.com/madnight/githut/blob/140f0177224d44d2b81ad541b67fffdea8597260/scripts/query.js#L65-L67

The current query adopts a "winner-takes-all" approach, which disadvantage less verbose programming languages by only selecting the lang with the largest byte count in each repository. A less biased approach would be list all programming languages present in each repository, rather than just the predominant one. Still, the languages could be weighted based on their respective byte counts, to not overemphasize small snippets.

madnight commented 8 months ago

Hi Lemmy, I think you've raised a valid point. Before I implemented the method for querying, I sampled many repositories and looked at how they were structured. You are right; some repositories include JavaScript libraries like jQuery and others next to their code, some even commit the whole node_modules folder with millions of lines of dependency. While your approach might be fairer in some cases, I think it doesn't work well in others. If someone decides to commit all their JS dependencies into the repo, the language report will show something like 99.8% JavaScript, 0.1% HTML, 0.1% Python (e.g., if it's a Python project that has HTML templates with some JS in the frontend). Your approach would correctly attribute the 0.1% Python to the language chart ranking, but it would still be massively incorrect in terms of percentage. It's sometimes even worse than that since node_modules can contain not only .js files but also TypeScript, Shell Scripts, CoffeeScript, and other languages. All these would then still rank higher than the main small Python Flask script with 20 lines of Python code you intended to count correctly.

All in all, from my long-term usage of GitHub and sampling, I would estimate that <1% of repositories commit their dependencies (e.g., node_modules in the case of JS). And most projects that do so are either JS projects or TypeScript projects. Hence, I think the error is overall small enough.

A good solution would be to filter out repositories that have folders that contain names like node_modules or vendor and the like and also filter out repositories that contain files like e.g., jquery-3.7.1.min.js. This would really eliminate this tracking error. But I don't think this kind of filtering is possible with the GitHub BigQuery Dataset, as this type of information is not available in the dataset.