src-d / borges

borges collects and stores Git repositories.
https://docs.sourced.tech/borges/
GNU General Public License v3.0
52 stars 20 forks source link

Producer: on adding 200 repos, only 178 are in DB #78

Closed bzz closed 7 years ago

bzz commented 7 years ago

top200repos.txt 200 lines files, cat top200repos.txt | sort -u | uniq -d -c is empty so there is no duplicated and wc -l is 200

bzz commented 7 years ago

Same happens with 2000 repos from text file:

testing=# select count(*) from repositories;
 count
-------
  1814
ajnavarro commented 7 years ago

@bzz is not a Borges error, we have duplicated repos in python and java lists, like this: github.com/mihaic/graphalytics.git

ajnavarro commented 7 years ago

@bzz Also you can check it in the top200repos.txt file too: sort top200repos.txt | uniq --count.

I will close the issue, feel free to reopen if I'm wrong.

bzz commented 7 years ago

Thanks for catching this! I believe the confusion is from sort -u above, wich already filters out all dupes.

sort top200repos.txt | uniq | wc -l 178 ✅ sort top200repos.txt | uniq | wc -l 1814