nelsonic / github-scraper

🕷 🕸 crawl GitHub web pages for insights we can't GET from the API ... 💡
425 stars 96 forks source link

Scraper does not account for thousands in Star/Watcher/Fork count #112

Closed gianlucascoccia closed 7 months ago

gianlucascoccia commented 4 years ago

I have found that, for popular repositories with thousands of stars and watchers, the scraper does not provide the correct output, probably as it disregards the 'k' appended after the number.

Take, for instance, the Atom repository:

Schermata 2020-06-25 alle 08 39 26

This is the output provided by my simple script, that just outputs the data received from the scraper:

Schermata 2020-06-25 alle 08 41 24
nelsonic commented 4 years ago

@gianlucascoccia thanks for opening this issue to inform us that the k values are no longer working. As you can see from the screenshot you have kindly shared, the commits value is NaN too ... 😕

GitHub have very recently updated their UI and changed a bunch of classes so our scraper/parser is no longer getting the correct data. #113

We have a RegEx that parses the 52.3k to 52300: https://github.com/nelsonic/github-scraper/blob/47d0a460db49b5ea3067ce4eb1d4e6bc27b7f505/lib/utils.js#L2-L16

and it has tests: https://github.com/nelsonic/github-scraper/blob/47d0a460db49b5ea3067ce4eb1d4e6bc27b7f505/test/utils.test.js#L7-L14

But as I say, GitHub have changed their UI/classes so they have "broken" our scraper. 🤦 If you want to help fix this by updating the classes in the repo file:

https://github.com/nelsonic/github-scraper/blob/47d0a460db49b5ea3067ce4eb1d4e6bc27b7f505/lib/repo.js#L26-L35

A pull request is very much welcome. Thanks. ☀️

nelsonic commented 7 months ago

fixed. see: https://github.com/nelsonic/github-scraper/actions/runs/7549448498/job/20553449066#step:5:655