I started to fix the ISBN scraping and then ended up making a bunch of changes!
ISBN
I've added what I think is a more reliable way of getting ISBN and ISBN13 to the get_books.py script
CSV Output
I added an option for CSV aggregated data output to both get_books.py and get_reviews.py
Built-in Web Driver Handling
I added two Python libraries that will automatically install the Selenium Chrome and Firefox web drivers (if they're not already installed) and add them to the user's PATH. I hope this will help make it more accessible for people!
Sort Order
I changed sort order from numbers (1, 2, 3) to words (default, newest, oldest)
Exceptions Handling
One of the big things I noticed when testing things out was that certain Goodreads books would cause our scraper to get caught in a loop — for example, to hit a pop-up on page 3 of the reviews and then restart the scraping for that book but then hit the same pop-up on page 3 over and over again. So I changed the code to just skip that page of reviews and move on.
This has the result of returning fewer than 300 reviews sometimes for some books, but to my mind that's better than completely derailing the script. The Goodreads website is so finicky — I don't think this script can ever be completely perfect.
I did notice, however, that Firefox seems to work way better with the scraper, which I note in the Readme.
Tutorial
I added a little Jupyter notebook tutorial that demonstrates how to use the scripts! Any feedback on it is welcome.
I started to fix the ISBN scraping and then ended up making a bunch of changes!
ISBN
I've added what I think is a more reliable way of getting ISBN and ISBN13 to the
get_books.py
scriptCSV Output
I added an option for CSV aggregated data output to both
get_books.py
andget_reviews.py
Built-in Web Driver Handling
I added two Python libraries that will automatically install the Selenium Chrome and Firefox web drivers (if they're not already installed) and add them to the user's PATH. I hope this will help make it more accessible for people!
Sort Order
I changed sort order from numbers (
1
,2
,3
) to words (default
,newest
,oldest
)Exceptions Handling
One of the big things I noticed when testing things out was that certain Goodreads books would cause our scraper to get caught in a loop — for example, to hit a pop-up on page 3 of the reviews and then restart the scraping for that book but then hit the same pop-up on page 3 over and over again. So I changed the code to just skip that page of reviews and move on.
This has the result of returning fewer than 300 reviews sometimes for some books, but to my mind that's better than completely derailing the script. The Goodreads website is so finicky — I don't think this script can ever be completely perfect.
I did notice, however, that Firefox seems to work way better with the scraper, which I note in the Readme.
Tutorial
I added a little Jupyter notebook tutorial that demonstrates how to use the scripts! Any feedback on it is welcome.