Hey! I was trying to use get_books.py this morning, and I ran into a couple issues, and it led me down a rabbit hole of updates and suggestions for the Goodreads Scraper.
Book ID
So I was actually trying to get some data about our classics books from the Goodreads API. The API generally requires either a book's ID or ISBN. This made me realize that the Goodreads API asks for the book's ID as simply a number, e.g., 1934 and not 1934.Little_Women. So I added some code to output both "book_id_title" (1934.Little_Women) and "book_id" (1934) in the JSON. There might be a better way of defining that information and explaining it all in the documentation.
ISBN
I think the web interface may have changed the way it presents ISBN info? The scraper was coming up empty with some ISBNs and causing an error, so I changed the code to report "isbn not found." But we might want to look into what's going on with the ISBNs in the HTML.
Condensed Data
I added some functions (originally written by you) to output aggregated files for the book metadata and reviews. I feel like that will probably be useful for people.
lxml
When I tried to run get_books.py, it said I needed to install the lxml parser, so I added that in the docs, and I also added a requirements.txt file.
WebDrivers
Since I was already knee-deep in the code, I started testing out get_reviews.py again. I realized that the WebDriver stuff is pretty confusing, especially for people who might not have as much computational experience. So I tried to explain it in more detail in the docs. I also made it so you have to define the file path to the WebDriver binary when you run the Python script. I don't know if that's the best way of doing things, but I feel like it might be more accessible for people (rather than explaining how to add the WebDriver to one's PATH).
get_reviews.py bugs
There were a couple other issues with the get_reviews.py script. The sort order wasn't being defined in a couple places, but I think I fixed that. And the Goodreads sign-in pop-up was breaking the script again, but I messed with the exceptions a bit, and I think it's working now.
CSV Output?
If we have time, it might be cool to include a CSV file output option. I started working on it, but I don't have time to finish it right now.
Hey! I was trying to use
get_books.py
this morning, and I ran into a couple issues, and it led me down a rabbit hole of updates and suggestions for the Goodreads Scraper.Book ID
So I was actually trying to get some data about our classics books from the Goodreads API. The API generally requires either a book's ID or ISBN. This made me realize that the Goodreads API asks for the book's ID as simply a number, e.g.,
1934
and not1934.Little_Women
. So I added some code to output both "book_id_title" (1934.Little_Women
) and "book_id" (1934
) in the JSON. There might be a better way of defining that information and explaining it all in the documentation.ISBN
I think the web interface may have changed the way it presents ISBN info? The scraper was coming up empty with some ISBNs and causing an error, so I changed the code to report "isbn not found." But we might want to look into what's going on with the ISBNs in the HTML.
Condensed Data
I added some functions (originally written by you) to output aggregated files for the book metadata and reviews. I feel like that will probably be useful for people.
lxml
When I tried to run
get_books.py
, it said I needed to install the lxml parser, so I added that in the docs, and I also added a requirements.txt file.WebDrivers
Since I was already knee-deep in the code, I started testing out
get_reviews.py
again. I realized that the WebDriver stuff is pretty confusing, especially for people who might not have as much computational experience. So I tried to explain it in more detail in the docs. I also made it so you have to define the file path to the WebDriver binary when you run the Python script. I don't know if that's the best way of doing things, but I feel like it might be more accessible for people (rather than explaining how to add the WebDriver to one's PATH).get_reviews.py
bugsThere were a couple other issues with the
get_reviews.py
script. The sort order wasn't being defined in a couple places, but I think I fixed that. And the Goodreads sign-in pop-up was breaking the script again, but I messed with the exceptions a bit, and I think it's working now.CSV Output?
If we have time, it might be cool to include a CSV file output option. I started working on it, but I don't have time to finish it right now.