Open junluo-aspecta opened 8 months ago
Hi @junluo-aspecta,
Yes, the bigquery-public-data.github_repos.languages include "only" 3 million repositories, but this sample size is statistically safe. For instance, in an election poll, they usually ask just a few thousand and scale that up to the whole population, e.g., the US population of 330 million. If you ask a few thousand, you have a quite big rate of error (a few percent), but if you ask 3 million, the error rate is extremely low.
Regarding the last updated in 2022, precisely Nov 27, 2022, well, that is an actual issue that I was not aware of yet. This means that I'm currently counting the Events correctly, but it does not include repositories after Nov 27, 2022. This alters the statistics significantly in the long run because we are only matching all the Events against a sample of 3M repos that were created before Nov 27, 2022. Hence, I have to find a new way to obtain a big enough sample size of repository language metadata that is up-to-date. Thanks for discovering and reporting this.
Hi @junluo-aspecta,
Okay, I did some research, thought for a while, and came up with a new idea. We can extract language information directly from the GH Archive Events because they are stored in the PullRequest Events. This amounts to a large sample size (millions of repositories) and they are up-to-date since we can count the language from the PullRequest events of the current quarter. The issue is that with this approach, we ignore any repository that has not seen any PullRequest over the last quarter (also not from any kind of bot such as Dependabot). I think it is a fair trade-off for now until we can maybe come up with a better idea.
https://github.com/madnight/githut/commit/f8adb52ef932b16f92e3cabb8a042d876124d1a1
I think it's fair to only count repos that have seen some activity in the given timeframe. But not counting pushes/issues/stars unless they got a pull request seems like a significant bias. Many small projects and even some bigger ones don't use pull requests.
Apparently languish switched to using github's GraphQL API to get the repo language, would that suit githut's needs ?
I agree with you regarding the bias.
Regarding the GitHub GraphQL API, they have pretty tight rate limits. These limits are designed for normal users, not for those who want to fetch information from millions of repositories. I'm not sure how Languish handles it, but they might be using a much smaller sample size. However, reducing the sample size can create statistical challenges, especially for the lower ranks.
Languish fetches 500 repos per GQL query, that's 6000 queries for 3 million repos. You could spread that over 24h and/or use a lot of caching (repo language changes should be rare).
I couldn't figure out what languish does on that front, maybe @tjpalmer can enlighten us.
I run once per quarter, and the graphql churns for a long time. Can take hours. But they haven't blocked me off yet. And yeah, I do cache results offline. And requests frequently error, so I run repeatedly. And some things never seem to show up. But I still feel like I see recent data better anyway. I may have been caching for too long now, though, since some may have changed their primary language since I started caching. Anyway, I just hobble along as well as I can. Maybe we should ask them for better access sometime. Seriously if they just dumped repos regularly still to bigquery, that would be awesome.
Oh. I also only look at repos that have at least 10 events in the quarter or some such. My memory is that it substantially reduces the number of repos I query on. Still lots and lots, though.
The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?