Missing "uncategorized" category from scraper, thus not retrieving all books

eherrerosj commented 3 years ago

When navigating the each category section, a list of books appears, which is used to gather all book urls and download them all afterwards. You would expect to be getting all books from Blinkist if you navigated the "Show all books" section from every existing category, right? Well... it's not the case.

Nonetheless, not all books seem to have a category, resulting in this (awesome) library to miss around 15% of all books (422 missing out of 2800). This problem is the cause of, for example, this issue.

Manual solution (shortcut to get all books)

I managed to craft a list of all currently existing books in Blinkist. I've done so through their own search engine. They use Algolia and from my previous experience, it can be queried using the client's x-algolia-application-id and x-algolia-api-key. Then simply call the /indexes/books-production/query endpoint. Then, I did a bit of clean-up on the JSON response and constructed each book URL. Here's the list that you can easily use with the --books argument.

leoncvlt commented 3 years ago

Good catch! I'll have a look to see if the uncategorized books can be found anywhere on the website at all. Cheers!

leoncvlt commented 3 years ago

Looks like another way to get the full book list would be to scrape the sitemap at https://www.blinkist.com/en/sitemap - one way to integrate this in the current workflow which sorts the books by categories would be to prepare a list of all the books (can filter the language by checking for the final -en or -de token), remove book as they are scraped from the categories list - everything that's left at the end of the process goes into the "uncategorized" folder.

rocketinventor commented 3 years ago

Cool! Is there any way in Algolia, maybe to filter results that don't have a label (instead of doing it manually)?

@leoncvlt I had an idea once to do something a little bit similar to what you just described with the categories list (scrape all of the categories lists once and save/cache the result somewhere) so that books downloaded individually (i.e. by URL) could be automatically sorted into the correct folder (instead of "Unsorted") by looking them up through on of the lists to see which categories(s) they belong too.

Implementing a category list download/look-up like that would be a good pre-step to implementing this feature, I think.

Adding a feature/script to re-do the meta (and change file/folder locations) for the "Unsorted" books that are already downloaded would be a good move, also, so that people's libraries could be more organized. (The category label is written in the book's .json file based on how it was scraped. The folder the book is saved in is based on that meta.)

Unfortunately, I don't have time to implement either of these feature. So I cannot be counted on to implement anything for now.

leoncvlt commented 3 years ago

The annoying bit is that I can't seem to find a reference to the category anywhere in the book metadata - seems like a one way parameter (you can get all books in a category from a category page, but you can't get the category a book belongs to from the book page). Happy to be proved wrong...

fabriciopirini commented 3 years ago

Hey guys, first time here and I'm really impressed with the quality of this tool that you guys put together! Congratz =)

I would like to help to implement the feature described above and I ran a query on the browser to see how reliable it is to get the books from the Sitemap. I could retrieve 2868 English books from there. No metadata found too but it could be useful to some kind of Download all books feature (that was what I was going for anyway haha).

To reproduce, go to the Sitemap and run the following:

elements = $$('section.sitemap__section.sitemap__section--books a[href$=en]')
for (var element of elements) { 
  console.log(element['href'])
}

I am aware of the performance issues with doing the check for ending in -en on CSS Selectors but we could use this to validate or get the output file from the Devtools: https://gist.github.com/fabriciopirini/70dd768459dd70554136358f12a18b87

Let me know how I can help from here.

leoncvlt commented 3 years ago

I wrote an implementation of this in the scrape-all branch:

If scraping all books (no specified categories, list or links), grab a list of all books URLs from the sitemap
The processed_books variable (which kept track of the numbers of books scraped) has been changed from an int to an array, where the URL of each scraped book is now stored in after scraping
After all categories have been scraped, we make a new list containing all the books that haven't been scraped yet, and scrape those into an "Uncategorized" list

Should work on paper, however I cancelled my Blinkist premium subscription so I'm not able to test this. Would any of you be available to give it shot and report back? 😄

fabriciopirini commented 3 years ago

I have been running your implementation on scrape-all branch and it has been working flawlessly up to now. Getting the right categories up to now and will update when it reaches the Uncategorized books.

EDIT: It just finished after 2 days haha It downloaded all books and I believe in the right categories. A couple of times the process got stuck while concatenating the audio files but this is probably connected to #10.

There it is the debug file: debug.log

However, the resume support helped in those times after killing and restarting the process.

leoncvlt commented 3 years ago

Brilliant - just merged into master. Thanks all!

leoncvlt / blinkist-scraper

Missing "uncategorized" category from scraper, thus not retrieving all books #29

Manual solution (shortcut to get all books)