leoncvlt / blinkist-scraper

📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output
191 stars 35 forks source link

Improved captcha handling #43

Closed johndoe-dev00 closed 3 years ago

johndoe-dev00 commented 3 years ago

I had some trouble with the login process and the captcha.

rocketinventor commented 3 years ago

@johndoe-dev00 In your testing (after the changes you made), did you find that the captcha still showed up? If so, did the page actually go away after you solved the captcha?

Also, why did you make the maximum time to solve the captcha = one minute? Is there a specific need that it cannot be longer?

johndoe-dev00 commented 3 years ago

@rocketinventor My changes do not prevent the captcha from showing up. At the beginning of my testing the captcha would show up frequently. After a while it became less frequent. Currently it does not show up at all anymore, even after deleting the cookie file. Maybe cloudflare white listed my ip or something. When the captcha actually does show up, you do need to solve it manually. After solving, you will be redirected back to blinkist and the scraper will continue its work (when the blinkist logo is detected). As posted by albert in #42, the captcha will fail to load correctly and you will not be able to proceed if uBlock is enabled. Hence the new command line switch '--no-ublock'

Why 60 sec wait time? 60s should be plenty to solve the captcha. In case someone is not watching the cli output, I don´t want him to wait 10min before timing out.

rocketinventor commented 3 years ago

If the only reason that uBlock needs to be disabled is to solve the captcha, then you can easily add it to the whitelist (the captcha was being intentionally blocked before):

At the bottom of the bin/ublock/ublock-settings.txt file, there should be a block of text, such: www.blinkist.com hcaptcha.com * block.

Change it to look like this: www.blinkist.com hcaptcha.com * allow

johndoe-dev00 commented 3 years ago

@rocketinventor I changed the ublock-settings.txt to allow hcaptcha.com. Seems to work quite well. I still kept the cli-switch --no-ublock in place, as i see it quite useful for troubleshooting.

FYI: Switching between from seleniumwire import webdriver and from selenium import webdriver (=book audio scrape not working) seems to trigger the captchas. Convenient for testing :)