maria-antoniak / goodreads-scraper

A Python scraper for Goodreads books and reviews.
GNU General Public License v3.0
264 stars 82 forks source link

fix some bugs, update documentation, add condensed data #6

Closed melaniewalsh closed 4 years ago

melaniewalsh commented 4 years ago

Hey! I was trying to use get_books.py this morning, and I ran into a couple issues, and it led me down a rabbit hole of updates and suggestions for the Goodreads Scraper.

Book ID

So I was actually trying to get some data about our classics books from the Goodreads API. The API generally requires either a book's ID or ISBN. This made me realize that the Goodreads API asks for the book's ID as simply a number, e.g., 1934 and not 1934.Little_Women. So I added some code to output both "book_id_title" (1934.Little_Women) and "book_id" (1934) in the JSON. There might be a better way of defining that information and explaining it all in the documentation.

ISBN

I think the web interface may have changed the way it presents ISBN info? The scraper was coming up empty with some ISBNs and causing an error, so I changed the code to report "isbn not found." But we might want to look into what's going on with the ISBNs in the HTML.

Condensed Data

I added some functions (originally written by you) to output aggregated files for the book metadata and reviews. I feel like that will probably be useful for people.

lxml

When I tried to run get_books.py, it said I needed to install the lxml parser, so I added that in the docs, and I also added a requirements.txt file.

WebDrivers

Since I was already knee-deep in the code, I started testing out get_reviews.py again. I realized that the WebDriver stuff is pretty confusing, especially for people who might not have as much computational experience. So I tried to explain it in more detail in the docs. I also made it so you have to define the file path to the WebDriver binary when you run the Python script. I don't know if that's the best way of doing things, but I feel like it might be more accessible for people (rather than explaining how to add the WebDriver to one's PATH).

get_reviews.py bugs

There were a couple other issues with the get_reviews.py script. The sort order wasn't being defined in a couple places, but I think I fixed that. And the Goodreads sign-in pop-up was breaking the script again, but I messed with the exceptions a bit, and I think it's working now.

CSV Output?

If we have time, it might be cool to include a CSV file output option. I started working on it, but I don't have time to finish it right now.