leoncvlt / blinkist-scraper

šŸ“š Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output
191 stars 35 forks source link

Fix book creation #21

Closed Arminius4 closed 4 years ago

Arminius4 commented 4 years ago

Fix book creation

I tested this with several books in both languages, with and without quotes.

The issue

  1. When blinkistscraper scrapes books, the audio files are correctly stored. However, the textual forms of the blinkists are missing content when the blinkist contains an interposed quote, such as in this book.

    Instead of the chapter's content, only the quote is added to the book!

  2. I also removed the necessity to provide email and password with the --no-scrape option because that constitutes a useless effort and does not make sense at all.

Notes and details

The blink is correctly scraped (i.e. all content is retrieved) into the json, however it is incorrectly turned into html/epub/pdf. In the json, three fields are of interest: text, content and supplement. The content field, which is used to create the books is actually redundant, because all information is contained in the other fields; in general, the text is stored twice in the json, however I don't address that in this PR. In case there is a quote, content holds the quote, otherwise it contains the chapter text also found in the text field. This is why the chapter text is sometimes lost. As in the blink, the quotes are displayed after the chapter in this PR. While fixing the bug, I noticed that the epub does not contain the "About the author" section, whereas the html does. It might be worth considering using the html as the basis for the creation of both, the epub and the pdf.

Note about the --no-scrape argument improvement: It turned out that argparse does not provide the functionality to have conditional arguments or optional positional arguments, which is why I chose a simple solution of not accepting the email and password arguments. This entails, that blinkistscraper can not be run if these are provided together with the --no-scrape option, but considering the alternative of always having to provide email and password even when it does not make sense, I decided I would include the change in this PR, especially because anyone using the option will have read the readme or help text in order to know that this option exists and will be surprised he is forced to provide arguments which are not needed. In fact, at first try I didn't provide email and password because I didn't expect it to be necessary. It's of course your choice whether to follow this consideration.

Arminius4 commented 4 years ago

Before you merge, please also have a look at this branch!

There, I fixed PDF creation, because wkhtmltopdf did not produce acceptable pdf files on my Ubuntu 20.04 system. I have not checked the results of wkhtmltopdf on other systems. The issues were:

  1. Generation took very long
  2. The resulting files were very large (approx. 25MB !)
  3. Text could not be selected
  4. The pdf was nearly unreadable due to a dark background colour and a very similar text colour.

The new way of creating PDFs in this branch that could be merged into this one has several advantages:

  1. Generation is much faster
  2. The resulting files are small (approx 150 kB)
  3. Text can be selected
  4. index of contents
  5. no OS call and external tools required

You get all of this for one requirement (weasyprint) that can automatically be installed by pip which is more user friendly than having to manually install a different tool.

leoncvlt commented 4 years ago

Hey, thanks for this! Do you have an example for a book with the supplement / text structure for the chapters? I tested your branch on this book but I seems to be getting some errors as the .json dump still only has the content field in the chapters - so I'm thinking it's something that Blinkist might have introduced at some point, but keeping backwards compatibility.

Regarding weasyprint - sounds good on paper! But it seems like on Windows the usage is not as straightforward as it might be on Linux. Testing it gives me an error about the Cairo library not being found - reading the installation instructions at https://weasyprint.readthedocs.io/en/stable/install.html it seems like it requires the user to install the GTK+ libraries manually.

Arminius4 commented 4 years ago

Hey Leonardo, take this book as an example. Blinkist recently introduced these interposed quotes in many of the newer books which replace the chapter content in the book creation. In the linked example, there are 3 chapters missing in the html/epub!

As I was only checking newer books, I didn't notice there are different json formats for older books. This means, that we need to introduce a version check to ensure compatibility.

Regarding the PDF generation: this is a valid objection; I didn't expect it to be so much more difficult on Windows to install weasyprint. On my Linux machine, I just executed pip3 install weasyprint and weasyprint including some dependencies was quickly and automatically installed, working from scratch like a charm without the need to manually install other libraries. The wkhtmltopdf pdf output on my machine was absolutely inacceptable and so the pdf generation feature could not be used out of the box (you will understand when you see it).

leoncvlt commented 4 years ago

Ahh, I figured it out - the text and supplement fields are present in the book data only for the free book of the day - all the other books are missing them (for obvious reasons, since the api endpoint that serves the book data is public), the script actually adds the content field by scraping the web reader for the book since I couldn't figure out a way to get a complete .json for books locked behind a premium account.

Arminius4 commented 4 years ago

OK. So the bug is only there for the blink of the day and all others just lack the quotes? Would the quick solution then simply be to continue using the content field as before except in case there exist text/ supplement fields that can be used as it is not yet possible to get the full json for the premium books?

rocketinventor commented 4 years ago

@Arminius4 Nice work. WeasyPrint seems to be a good solution, except for the Windows/OSX installation caveat... Some thoughts:

leoncvlt commented 4 years ago

Figured this out - the "supplements" aka notes are displayed in a different html element than the one containing the chapter text - that's why they were not being picked up when scraping the reader page. I'll add the logic to process those supplements as well when present, and merge in the rest of the improvements šŸ‘

Regarding wkhtmltopdf, I sent Arminius an example .pdf exported on my machine and it does look as good as the one he got using weasyprint - so that's weird, it might be that the default wkhtmltopdf on apt-get is a different version or that it behaves differently on unix?

leoncvlt commented 4 years ago

The plot thickens - looks like the supplement field is only present on the free books of the day in the web reader as well - I checked the same book I was looking at yesterday, and there are no supplement sections any more when reading the book directly on the Blinkist website. Regardless, I implemented better support for the free daily book and supplements fields in 120d91a035101c7f905dea7d51c9dba6cace9c6e