rust-lang / rust-repos

Dataset of Rust source code repositories
MIT License
94 stars 31 forks source link

Tracking issue for rework of rust-repos scraper #121

Open NULLx76 opened 2 days ago

NULLx76 commented 2 days ago

Last year, I used this repository as part of my research of analysing release practices of all Java repositories on GitHub. During this, I discovered that this repository had a few issues, partially to just not being updated in a while. I hope it is not too presumptuous of me to suggest a rework, but I think it could be a nice thing to do and am willing to take it on myself.

This is a tracking issue, documenting all the things I've found (and still remember). When I encounter/remember more, I'll add them to this issue.

The final scraper I have implemented for Java can be found here, specifically in src/scraper. I'd mostly want to port that code to rust-repos as I've verified it to work and should be mostly applicable.

A natural issue I ran into when scraping millions of repositories is that it can take weeks to scrape all of GitHub when respecting the rate-limits (while using some tricks even). There are different solutions to this, but importantly it is good to find out how much of an issue this is with Rust, as there are far fewer repositories than Java. This is also related to #65, in its current state it may simply not be feasible to do that, but I can look into if it is.

NULLx76 commented 2 days ago

@rustbot claim

rustbot commented 2 days ago

Error: This repository is not enabled to use triagebot. Add a triagebot.toml in the root of the default branch to enable it.

Please file an issue on GitHub at triagebot if there's a problem with this bot, or reach out on #t-infra on Zulip.