leoncvlt / blinkist-scraper

📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output
191 stars 35 forks source link

General & needed improvements #15

Closed rocketinventor closed 4 years ago

rocketinventor commented 4 years ago

Hi, I appreciate the utility of this program by a good amount. While I am still new to this project, I did not find that as any obstacle for making changes to this program yet. After investing some time into it, I have made some improvements. They address bugs/annoyances with the current state, and also should improve this program for everybody.

Included in this pull request:

See the commit messages for more details - I added a lot of notes explaining everything.

Footnote:

I don't have a premium subscription anymore. Therefore, I was not able to test everything out completely. However, everything should work, and has been tested working once the following lines are removed (in main.py):

Tested with main.py --book https://www.blinkist.com/en/books/10-days-to-faster-reading-en --keep-noncat --concat-audio --headless, etc, and all of its variations.

leoncvlt commented 4 years ago

Wow, thanks for this! I ran it with a premium account on top of a global scrape I did a few months ago, and seems to work fine - added around 60 new books, although the number of books processed by the script in the end returned zero 🤔 will have to look into that.

Did you notice any performance improvements when using chromedriver with ublock vs without? At the end of the day we're just visiting blinkist.com, and looking at the ublock log for it the only "questionable" things from it were a few scripts from google analytics and newrelice (a web performance monitoring service, from what I could gather). Not that I mind having it, only weighting if the overhead of initialising it is worth it.

I'm gonna merge this on my dev branch since I wanted to work on some improvements this weekend - mainly tidying up the application structure since it's starting to get a bit crowded (and as you rightly noted main should have a place of its own), and setting up and automated install of the appropriate chromedriver distribution to address #14 - the only bundled in is for Chrome 81, the one on machine machine got already updated to 83 so it might cause issues for some users.

Thanks again!

leoncvlt commented 4 years ago

One question - in the download major enhancement commit you say

I noticed that scraping was taking almost as much time for books that were downloaded, as for books that were not.

The culprit for this, I found, was that the scraper was unnecessarily re-scraping books that were finished downloading already.

While the scraper would skip the actual download of whatever files were already present, it didn't make much of a difference - waiting for all of those pages to load again takes up a huge amount of time.

Two changes were made in this commit: Skipping already downloaded books (and concat operations) and the ability to keep the individual audio files (as well as the concat).

The ability to keep the individual files is indeed an addition, but the concat_audio_exists check was already happening in scraper.scrape_book_audio right at the beginning of the function, where it checks if the concatenated audio file exists, and if it does it returns False - way before the chromedriver gets to navigate to the book's page.

I did a test on a small batch of already downloaded books, and I've noticed no difference in speed to getting to the first non-downloaded book between the original implementation and your one where the concat_audio_exists check is done in main.py. Was it not the case for you?

rocketinventor commented 4 years ago

Greetings @leoncvlt! Thank you for the merge!

You are correct - in regards to skipping books that already have the concatenated audio present there is no difference. I made the feature for dealing with books that have the chapter audio downloaded, but not concatenated. Afterward, I added a check for concatenated audio also - I did not realize that there was already a check for that.

As far as the content blocking, I know that there are Google Tags, Facebook, Newrelic, and Cookie Bot, scripts, etc. present on the site. All of these scripts take up a lot of time across multiple page loads - but I did not benchmark it. uBlock definitely has some overhead, even if it is minimal. Also, it doesn't seem to load (for me) in headless mode (or book mode as noted in the code). Obviously, the best way to deal with content blocking would be to do it through selenium/chromedriver, but I haven't yet found a way to do this. The next best option would be to make a custom chrome extension that blocks specific, hand-picked resources. This should not be too difficult to do, and could even be its own, self-contained, project (i.e with a command-line interface and toolchain for generating the final extension. Aside from that, we could switch to creating a custom chrome profile on the first run and then use that profile on subsequent runs (a well-documented feature). Because networking was more of an issue for me than script overhead, I set up the uBlock based on whichever lists seemed relevant. It should be simple for anyone to pick them out by hand and then re-do the uBlock export with only those - less blocking rules should be (a little bit) faster.

BTW, the note that I said about making main into its own function was a personal note that I accidentally did not strip out... However, I am glad, now, that you saw it and noted it positively. I do think that having it as its own function, with everything else using it, would be the way to go. This would make many things easier.