leoncvlt / blinkist-scraper

📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output
191 stars 35 forks source link

Stuck in Cloudflare hCaptcha loop. #31

Closed GermanEngineering closed 3 years ago

GermanEngineering commented 3 years ago

Hello and first of all thank you very much for your work!

It looks, like this is exactly the code that I was looking for, but unfortunately I'm not able to get it running because I get stuck in an endless Cloudflare hCaptcha loop on https://www.blinkist.com/en/nc/login when I'm trying to execute it the first time. The "One more step - Please complete the security check to access - I am human" appears before entering the login information and no matter how often I solve it, I always end up at the next Captcha (tried it for at least 9 times in a row).

My system:

I've already tried:

Unfortunately I don't have any other ideas at the moment and feel pretty lost/stupid. Did you encounter this problem before and have an idea how to solve it? Or are there some logfiles or something I can collect that might help in this case?

Thank you very much in advance! Peter

bckncook commented 3 years ago

Same issue here. Looking forward to solution. Thank you!!!

GermanEngineering commented 3 years ago

Hello again,

I tested two more things:

  1. Tried to use cookies from chrome

    • logged in to blinkist in chrome
    • added chrome_options.add_argument("user-data-dir=C:\Users\Win10x64\AppData\Local\Google\Chrome\User Data\") argument to chomedriver to use the settings from chrome in chromedriver
    • executed get_login_cookies() to get cookies.pkl
    • started initial code with login cookies
    • gui mode is running into Captcha loop again
    • headless mode is running into timeout
    • [1608123763.485][INFO]: Waiting for pending navigations... [1608123763.486][INFO]: Done waiting for pending navigations. Status: ok [1608123763.493][INFO]: Waiting for pending navigations... [1608123763.494][INFO]: Done waiting for pending navigations. Status: ok [1608123763.494][INFO]: [6319a21f140a99f67240dc6507ddab98] RESPONSE FindElement ERROR no such element: Unable to locate element: {"method":"class name","selector":"main-banner-headline-v2"} (Session info: headless chrome=87.0.4280.88) [1608123764.001][INFO]: [6319a21f140a99f67240dc6507ddab98] COMMAND FindElement { "sessionId": "6319a21f140a99f67240dc6507ddab98", "using": "class name", "value": "main-banner-headline-v2" }
  2. Tried selenium with Firefox

    • with driver = selenium.webdriver.Firefox()
    • --> also running into the same Captcha loop

Unfortunately nothing was successful, but maybe it helps to narrow down the root cause of the problem. Thank you very much, again! Peter

leoncvlt commented 3 years ago

It seems like Blinkist / Cloudflare moved from Goggle's captchas (which worked fine) to HCaptcha which causes this issue. From GermanEngineering's tests it seems like more of an issue of Cloudflare detecting the Chromedriver since even with legit cookies this persists. Will need to look into it - any help welcome!

GermanEngineering commented 3 years ago

I found a solution that at least allows me to login and download the text. It doesn't seem to work in headless mode though. And with the --audio option im running into the json.decoder.JSONDecodeError Exception. I don't think that this is related to the change I made, but on the other hand I don't know if/how it was working before.

I tried to do a pull request, but I'm not really familiar with the GitHub process, so please excuse me if this is not the correct way to propose a change. In the end it was just adding: chrome_options.add_argument("--disable-blink-features=AutomationControlled") to the Chrome options in the scraper.py

Hope this helps.

wywywywy commented 3 years ago

That's weird. I tried all these options and it still won't let me through the hcaptcha.

    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
wywywywy commented 3 years ago

It'd be much better to convert this from Selenium to Puppeteer.

I just tried Puppeteer and that works well, especially with the Stealth plugin.

rocketinventor commented 3 years ago

I think that there used to be a chrome extension from Cloudflare that bypasses their captcha page. Perhaps that would help? Has anyone tried it?

wywywywy - Do you think that you could create a new branch with your changes and make a pull-request with the Puppeteer-based code? Thanks!

mikaelaatan commented 3 years ago

Hello, I'm not familiar with how Github works, but I'll just share what worked for me. I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") from GermanEngineering's suggestion.

At first it worked, but for the next sessions, it started going back to the captcha again. The workaround is after logging in, and when it goes to the cloudfare site, redirect the browser back to Blinkist.com homepage. This is when the log says, "waiting for user to solve recaptcha and login. After that, the scraper will proceed as expected.

flowni commented 3 years ago

Hello, I encounter the same problem as you guys, getting stuck in the infinity captcha-loop...

I think we definitely have to add this line chrome_options.add_argument("--disable-blink-features=AutomationControlled"). I also added headers and a user-data-dir to always use the same profile everytime but that's not enough as the loop still appears, as already mentioned.

As a first quick fix, it worked for me to change from seleniumwire webdriver to the "normal" selenium webdriver. Doing this you can at least scrape the texts but to get the audio files you need to have access to the request tab, so audio scraping won't work any longer with this. Does someone have an idea why the website could know it's a bot with seleniumwire webdriver with the exact same settings of the selenium webdriver?

Edit: I think the problem has something to do with the certificate as selenium-wire issues its own certificate (selenium-wire manual). I already added the Selenium Wire CA to Chrome's Authorities section, but the problem remains.

rocketinventor commented 3 years ago

It could be that selenium-wire is adding some variable to the page that the anti-bot script is able to detect (and selenium doesn't use it). A version compiled to use a different variable name could probably fix this.

I don't have my test rig available right now, but has anyone tried to use the official Cloudflare/HCaptcha bypass extension: "Privacy Pass" in their tests? (https://chrome.google.com/webstore/detail/privacy-pass/ajhmfdgkijocedmfjonnpjfojldioehi)

Another option might be to switch over to "Pyppeteer" (unofficial port of Puppeteer on PyPi). I have not read the documentation on it, yet. - It could be that it does not provide enough information for the audio DL's.

There should be a way to do it without a headless Chrome browser (using requests, JSEval (like youtube-dl).

usb4 commented 3 years ago

I also run into the hCaptcha loop but can get around it with the following arguments:

    # prevent Cloudflare from detecting ChromeDriver as bot
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")

Occasionally, without these arguments, I find that my first scrape attempt in 12+ hours usually avoids triggering Captcha.

However, audio scraping still doesn't work.

[13:09:39] WARNING Could not find audio url in request, aborting audio scrape...
[13:09:39] ERROR Error processing audio url, aborting audio scrape...
leoncvlt commented 3 years ago

In my tests, I had to override the user agent as well on top of implementing @usb4's flags. Although it still asked for the captcha when making a request for the blink's audio files.

Reading around, I found this discussion - https://stackoverflow.com/questions/32795460/loading-json-object-in-python-using-urllib-request-and-json-modules - and magically, yes, using urllib.request instead of requests doesn't seem to trigger the captcha. I tried implementing the other approach they suggested, where you connect to the IP address instead of the host, but was getting some SSL problems.

I pushed my changes in f4cab052ae1fb9789bac8e01c5f77734775c936d, tested (albeit only on the free daily book) and seems to work fine on my end.

rocketinventor commented 3 years ago

Leonardo, which user agent did you use with requests? The default one is a scraper user-agent. That could be why 'urllib.request' "magically" works.

GermanEngineering commented 3 years ago

Thank you very much leoncvlt!

leoncvlt commented 3 years ago

Leonardo, which user agent did you use with requests? The default one is a scraper user-agent. That could be why 'urllib.request' "magically" works. In my tests (Windows 10), it was enough to switch from 'seleniumwire.webdriver' to 'selenium.webdriver' (Flowni's "quick fix") and maybe also add in the "--disable-blink-features=AutomationControlled" argument (as per Peter's comment). However, it doesn't seem like any of the other arguments/lines, user-agents, data-dirs, etc, are needed at all. Perhaps those arguments could even prevent selenium-wire from accessing the audio URL's/requests properly. As far as the audio goes, it looks like there is a hard-coded URL now that points to the chapter audio... If so, it might be possible to completely ditch the chrome/selenium web-driver (except maybe to get the cookies). That should really get its own issue / pull-request, so I won't discuss the details much here.

In my case, the user agent was needed to access the actual library / books pages, not specifically for the audio files.

I'm using selenium wire to capture the original audio files request and re-use the cookies / auth information to request the rest of the audio blinks - if anyone can come up with an alternative way of accomplishing this, we could scrap the selenium wire requirements 😃