github-vet / bots

Bots for running analysis on GitHub's public Go repositories and crowdsourcing their classification.
MIT License
1 stars 1 forks source link

duplicates can be written to repository visited file. #92

Open kalexmills opened 3 years ago

kalexmills commented 3 years ago

Welp...

/ # wc -l data/visited_repos.csv
321683 data/visited_repos.csv
/ # wc -l repos.csv
223664 repos.csv

That's a thing.

kalexmills commented 3 years ago

Ok, maybe it's not so bad.

/ # cat data/visited_repos.csv | sort | uniq | wc -l
171887
/ # wc -l repos.csv
223664 repos.csv

It seems that VetBot is just writing duplicates into visited_repos.csv, which is read into a set.

So maybe it's actually visiting repositories twice. Still seems there is an issue, though.