Closed Arminius4 closed 4 years ago
Before you merge, please also have a look at this branch!
There, I fixed PDF creation, because wkhtmltopdf did not produce acceptable pdf files on my Ubuntu 20.04 system. I have not checked the results of wkhtmltopdf on other systems. The issues were:
The new way of creating PDFs in this branch that could be merged into this one has several advantages:
You get all of this for one requirement (weasyprint) that can automatically be installed by pip which is more user friendly than having to manually install a different tool.
Hey, thanks for this! Do you have an example for a book with the supplement
/ text
structure for the chapters? I tested your branch on this book but I seems to be getting some errors as the .json dump still only has the content
field in the chapters - so I'm thinking it's something that Blinkist might have introduced at some point, but keeping backwards compatibility.
Regarding weasyprint - sounds good on paper! But it seems like on Windows the usage is not as straightforward as it might be on Linux. Testing it gives me an error about the Cairo library not being found - reading the installation instructions at https://weasyprint.readthedocs.io/en/stable/install.html it seems like it requires the user to install the GTK+ libraries manually.
Hey Leonardo, take this book as an example. Blinkist recently introduced these interposed quotes in many of the newer books which replace the chapter content in the book creation. In the linked example, there are 3 chapters missing in the html/epub!
As I was only checking newer books, I didn't notice there are different json formats for older books. This means, that we need to introduce a version check to ensure compatibility.
Regarding the PDF generation: this is a valid objection; I didn't expect it to be so much more difficult on Windows to install weasyprint. On my Linux machine, I just executed pip3 install weasyprint
and weasyprint including some dependencies was quickly and automatically installed, working from scratch like a charm without the need to manually install other libraries.
The wkhtmltopdf pdf output on my machine was absolutely inacceptable and so the pdf generation feature could not be used out of the box (you will understand when you see it).
Ahh, I figured it out - the text
and supplement
fields are present in the book data only for the free book of the day - all the other books are missing them (for obvious reasons, since the api endpoint that serves the book data is public), the script actually adds the content
field by scraping the web reader for the book since I couldn't figure out a way to get a complete .json for books locked behind a premium account.
OK. So the bug is only there for the blink of the day and all others just lack the quotes?
Would the quick solution then simply be to continue using the content
field as before except in case there exist text
/ supplement
fields that can be used as it is not yet possible to get the full json for the premium books?
@Arminius4 Nice work. WeasyPrint seems to be a good solution, except for the Windows/OSX installation caveat... Some thoughts:
You noted that your output PDF's don't look so great - Is the problem with WKHTML? Or with the command line parameters not being correct (for your installation)? Some quick research reveals that WKHTML
has a c library available. Perhaps this could be used instead? Or maybe it makes sense to use WKHTML
on Windows/OSX, and WeasyPrint
on Linux (with an option to pick the renderer, if desired)?
I like the change to not require credentials each time - I have been using a similar patch, personally. In the future, I would like to see this package switch away from the username password
requirement and maybe to a config file, or something similar.
Is it possible that you could copy the JSON that you are getting?
Figured this out - the "supplements" aka notes are displayed in a different html element than the one containing the chapter text - that's why they were not being picked up when scraping the reader page. I'll add the logic to process those supplements as well when present, and merge in the rest of the improvements š
Regarding wkhtmltopdf, I sent Arminius an example .pdf exported on my machine and it does look as good as the one he got using weasyprint - so that's weird, it might be that the default wkhtmltopdf on apt-get is a different version or that it behaves differently on unix?
The plot thickens - looks like the supplement
field is only present on the free books of the day in the web reader as well - I checked the same book I was looking at yesterday, and there are no supplement sections any more when reading the book directly on the Blinkist website. Regardless, I implemented better support for the free daily book and supplements fields in 120d91a035101c7f905dea7d51c9dba6cace9c6e
Fix book creation
I tested this with several books in both languages, with and without quotes.
The issue
When blinkistscraper scrapes books, the audio files are correctly stored. However, the textual forms of the blinkists are missing content when the blinkist contains an interposed quote, such as in this book.
Instead of the chapter's content, only the quote is added to the book!
I also removed the necessity to provide email and password with the
--no-scrape
option because that constitutes a useless effort and does not make sense at all.Notes and details
The blink is correctly scraped (i.e. all content is retrieved) into the json, however it is incorrectly turned into html/epub/pdf. In the json, three fields are of interest:
text
,content
andsupplement
. The content field, which is used to create the books is actually redundant, because all information is contained in the other fields; in general, the text is stored twice in the json, however I don't address that in this PR. In case there is a quote,content
holds the quote, otherwise it contains the chapter text also found in thetext
field. This is why the chapter text is sometimes lost. As in the blink, the quotes are displayed after the chapter in this PR. While fixing the bug, I noticed that the epub does not contain the "About the author" section, whereas the html does. It might be worth considering using the html as the basis for the creation of both, the epub and the pdf.Note about the
--no-scrape
argument improvement: It turned out that argparse does not provide the functionality to have conditional arguments or optional positional arguments, which is why I chose a simple solution of not accepting the email and password arguments. This entails, that blinkistscraper can not be run if these are provided together with the--no-scrape
option, but considering the alternative of always having to provide email and password even when it does not make sense, I decided I would include the change in this PR, especially because anyone using the option will have read the readme or help text in order to know that this option exists and will be surprised he is forced to provide arguments which are not needed. In fact, at first try I didn't provide email and password because I didn't expect it to be necessary. It's of course your choice whether to follow this consideration.