Collect Amazon product IDs and safe in *.json

leoncvlt / blinkist-scraper

📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output

190 stars 36 forks source link

Collect Amazon product IDs and safe in *.json #48

Open johndoe-dev00 opened 3 years ago

johndoe-dev00 commented 3 years ago

Use case Scrape Amazon product IDs from each book in order to later scrape the Amazon product pages for the review information.

Functionality I scraped the Amazon product IDs ("ASIN") from the "Buy" button on the https://www.blinkist.com/en/books/... pages. The ASIN is required to generate the Amazon product links https://www.amazon.com/dp/<ASIN> If available, the product IDs are stored in the *.json files. This feature needs to be enabled through the commandline switch --get-amazon-url

Also I added a "category_id" to the *.json files, that represents the index of the scraped category.

leoncvlt commented 3 years ago

This looks fine, but is there any reason to have this other than "it's a cool feature"? I'm a huge advocate of the "do one thing and do it well" philosophy, and I see this tool as a scraper of the blinkist material mainly, what's the advantage in lengthening this by visiting one extra page for each book just to get the Amazon asin? Even if put behind an optional flag, it add more logic and arguments to keep track of to the scrape_book_data method and others.

johndoe-dev00 commented 3 years ago

Well, I guess my use case is a bit different, than the usual offline reading. I find the Blinkist Smartphone App pretty crappy in terms of deciding, which book to listen to next. So I want to listen to the books with the most amazon reviews. And with the Amazon IDs I gathered, I was able to scrape Amazon and generate a list of all blinkist books, ordered by number of amazon reviews:

https://htmlpreview.github.io/?https://github.com/johndoe-dev00/blinkist-books-sorted-by-amazon-reviews/blob/main/!index.html

Other people might have similar use cases, that involve a books Amazon ID, thats why I thought, I would create a pull requests. If you think this is out of scope of the intended use case, feel free to reject the pull request.