leoncvlt / blinkist-scraper

📚 Python tool to download book summaries and audio from Blinkist.com, and generate some pretty output
190 stars 35 forks source link

Cloudflare blocking scraping #46

Open davidelionetti opened 3 years ago

davidelionetti commented 3 years ago

Hello, I have just started using this library and all seems to be correctly set up. I ran python blinkistscraper email password with my credentials and Cloudflare unfortunately detects (I assume) an automated activity and blocks me from navigating to Blinkist.com on the browser instance that got opened by the script.

Any ideas?

Riviss commented 3 years ago

I think this issue may be the same as the "Captha taking longer than expected." Take a look to see if your problem is the same and if so, hopefully one of the solutions posted there will work for you.

klochden commented 3 years ago

Yes, cloudflare definitely detects an unusual activity and you land in an endless cycle of captchas. No matter what I'm trying, it don't let me through.

vongyver commented 3 years ago

I am experiencing the same thing as klochden. I have tried to adjust, even disable UBlock with no success. If there is any way I can assist with debugging or testing, let me know.

rocketinventor commented 3 years ago

@klochden @vongyver

Are you running chrome in headless mode when you face this issue?

vongyver commented 3 years ago

I get the chrome app popup and complete the captcha, but keep failing. I have even tried to disable the ublock in various ways. I have also signed into Blinkist on regular chrome, which I can, but the script still fails. Happy to test any specifics. Thank you!

johndoe-dev00 commented 3 years ago

FYI: uBlock can be disabled using the --no-ublock switch.

I also got the cloudflare captcha loop. This seems to be new. Currently this workaround seems to be working for me: In scraper.py change from seleniumwire import webdriver to from selenium import webdriver This fixes the cloudflare issue, but this will not allow you to download the audio files, as that part requires seleniumwire, everything else should work, though. Let me know if this allows you to login. Will look into a fully functioning fix.

klochden commented 3 years ago

Hey, thank you very much! Will try it out tomorrow since today is very late now. But to me, the audio files are the most important target, so I hope you can figure out how to get the script working. I will let you know tomorrow! Thanks again! Regards  

FYI: uBlock can be disabled using the --no-ublock switch. I also got the cloudflare captcha loop. This seems to be new. Currently this workaround seems to be working for me: In scraper.py change from seleniumwire import webdriver to from selenium import webdriver This fixes the cloudflare issue, but this will not allow you to download the audio files, as that part requires seleniumwire, everything else should work, though. Let me know if this allows you to login. Will look into a fully functioning fix. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .  

klochden commented 3 years ago

Hello, no, I don’t because it is not recommended in the readme file. Thanks!   @.*** @vongyver

Are you running chrome in headless mode when you face this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .

klochden commented 3 years ago

FYI: uBlock can be disabled using the --no-ublock switch.

I also got the cloudflare captcha loop. This seems to be new. Currently this workaround seems to be working for me: In scraper.py change from seleniumwire import webdriver to from selenium import webdriver This fixes the cloudflare issue, but this will not allow you to download the audio files, as that part requires seleniumwire, everything else should work, though. Let me know if this allows you to login. Will look into a fully functioning fix.

Yes it worked now Ind I was able to log in without an issue. Also the SSL Certificate was active, what is an important thing for Cloudflare I think! But got only a JSON Text file, no audio. Hopefully someone can recover the main functions. To download everything complete. Thanks to All!

klochden commented 3 years ago

FYI: uBlock can be disabled using the --no-ublock switch.

I also got the cloudflare captcha loop. This seems to be new. Currently this workaround seems to be working for me: In scraper.py change from seleniumwire import webdriver to from selenium import webdriver This fixes the cloudflare issue, but this will not allow you to download the audio files, as that part requires seleniumwire, everything else should work, though. Let me know if this allows you to login. Will look into a fully functioning fix.

By the way, with the chrome addon "Audio Downloader Prime" I could manually download the audio files without an issue. Maybe there is a possibility to implement an automated solution?

rocketinventor commented 3 years ago

I've looked into this issue a little bit...

The project is using an old version of seleniumwire (2.1.2 vs the newest version: 4.2.4). This could be part of why Cloudflare is having so many issues with it. If the package is upgraded then we can fix a lot of issues and take advantage of new features.

For example:

4.1.1 (2021-02-26) Integration with undetected-chromedriver.

Also, we might be able to remove the seleniumwire/mitmproxy requirement completely by using Chrome Devtools Protocol, directly.

I will try to look into those two things.

jonaschn commented 3 years ago

Manually using the privacy-pass extension makes scraping audio, e.g., of the daily book, possible again because you get 30 passes when solving 1 captcha.

vongyver commented 3 years ago

Manually using the privacy-pass extension makes scraping audio, e.g., of the daily book, possible again because you get 30 passes when solving 1 captcha.

I am able to add privacy-pass to my regular chrome and add the 30 passes. When I run the scraper it does not appear in the dev-tools instance and I am still being asked to deal with the captchas that is still circular. How do we add the privacy-pass into the dev-tools instance. Thanks!

jonaschn commented 3 years ago

I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. But this can definitely be automatized similar to the ublock extension. Maybe @leoncvlt or someone else has some time to automate this process.

vongyver commented 3 years ago

Thanks for the feedback. I poked around, however I have no idea how to add privacy-pass in the chrome instance or increase the time. I am not really a developer, more a hack. I know my limits. All good. I hope that leoncvlt is able to fix it soon.

On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider @.***> wrote:

I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. But this can definitely be automatized similar to the ublock extension. Maybe @leoncvlt https://github.com/leoncvlt or someone else has some time to automate this process.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-854461406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ .

ilearnio commented 3 years ago

Same issue. Resolving captcha brings another captcha and so on, so can't get this script to work

hxh103 commented 3 years ago

@vongyver you can change scraper.py (currently line 180) : WebDriverWait(driver, 60) --> WebDriverWait(driver, 360) that will change it from 60 seconds to 360 seconds.

but I did the privacy-pass method from @jonaschn and it does not work for me.

changed to selenium and it works minus not being able to download the audio, which is a huge bummer. Hopefully, this gets fixed soon.

Thanks for the feedback. I poked around, however I have no idea how to add privacy-pass in the chrome instance or increase the time. I am not really a developer, more a hack. I know my limits. All good. I hope that leoncvlt is able to fix it soon. On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider @.***> wrote: I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. But this can definitely be automatized similar to the ublock extension. Maybe @leoncvlt https://github.com/leoncvlt or someone else has some time to automate this process. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ .

vongyver commented 3 years ago

No luck on that change. I didn't think it was the 60 sec limit, I am resolving two sets of images within 20 seconds. I get the classic bicycle or boat and after completing it, the session flips to the blinkist login screen and then back to the "not a robot" checkbox and image sets again, repeatedly. It does not look like it's trying to enter the passed credentials.

Just to clarify, I have confirmed that I do have the latest version, having cloned fresh a couple of times. Just a note, I had a password with an "&" in it and had trouble passing that to the blinkistscraper, so I changed the password.

I had no luck with adding privacy-pass either.

Thanks for the recommendation. Happy to test what's offered.

On Mon, Jul 12, 2021 at 8:04 PM hxh103 @.***> wrote:

@vongyver https://github.com/vongyver you can change scraper.py (currently line 180) : WebDriverWait(driver, 60) --> WebDriverWait(driver, 360) that will change it from 60 seconds to 360 seconds.

but I did the privacy-pass method from @jonaschn https://github.com/jonaschn and it does not work for me.

changed to selenium and it works minus not being able to download the audio, which is a huge bummer. Hopefully, this gets fixed soon.

Thanks for the feedback. I poked around, however I have no idea how to add privacy-pass in the chrome instance or increase the time. I am not really a developer, more a hack. I know my limits. All good. I hope that leoncvlt is able to fix it soon. … <#m1893796245102071081> On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider @.***> wrote: I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. But this can definitely be automatized similar to the ublock extension. Maybe @leoncvlt https://github.com/leoncvlt https://github.com/leoncvlt or someone else has some time to automate this process. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment) https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-854461406>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-878720311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSQXC4NSSSJDP27XWMTTXONLDANCNFSM42ARMJMQ .

hxh103 commented 3 years ago

It was more to give you enough time to manually add privacy-pass extension to that instance of chrome before the timeout that was first suggested. Anyways, that did not work for me and I assume it would not for you either.

No luck on that change. I didn't think it was the 60 sec limit, I am resolving two sets of images within 20 seconds. I get the classic bicycle or boat and after completing it, the session flips to the blinkist login screen and then back to the "not a robot" checkbox and image sets again, repeatedly. It does not look like it's trying to enter the passed credentials. Just to clarify, I have confirmed that I do have the latest version, having cloned fresh a couple of times. Just a note, I had a password with an "&" in it and had trouble passing that to the blinkistscraper, so I changed the password. I had no luck with adding privacy-pass either. Thanks for the recommendation. Happy to test what's offered. On Mon, Jul 12, 2021 at 8:04 PM hxh103 @.> wrote: @vongyver https://github.com/vongyver you can change scraper.py (currently line 180) : WebDriverWait(driver, 60) --> WebDriverWait(driver, 360) that will change it from 60 seconds to 360 seconds. but I did the privacy-pass method from @jonaschn https://github.com/jonaschn and it does not work for me. changed to selenium and it works minus not being able to download the audio, which is a huge bummer. Hopefully, this gets fixed soon. Thanks for the feedback. I poked around, however I have no idea how to add privacy-pass in the chrome instance or increase the time. I am not really a developer, more a hack. I know my limits. All good. I hope that leoncvlt is able to fix it soon. … <#m1893796245102071081> On Fri, Jun 4, 2021 at 2:00 AM Jonathan Schneider @.> wrote: I did not automate this process but increased the time allowed for solving the captcha and then manually installed privacy-pass in the chrome instance opened when running the scraper. For now, this needs to be done every time the scraper is run. But this can definitely be automatized similar to the ublock extension. Maybe @leoncvlt https://github.com/leoncvlt https://github.com/leoncvlt or someone else has some time to automate this process. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment) <#46 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLST4YDAM3OHPE7JZL3LTRCBZ5ANCNFSM42ARMJMQ . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSQXC4NSSSJDP27XWMTTXONLDANCNFSM42ARMJMQ .

hxh103 commented 3 years ago

So I found a solution that worked for me. it requires a bit of manual work but it downloads audio now. at least for me, the problem seems to be in the user-agent and the version of selenium-wire identified by @rocketinventor

So this worked for me:

  1. Made a new environment in Anaconda (just to be sure there are not other incompatibility issues)
  2. install all the required packages manually using pip (chromedriver-autoinstaller colorama EbookLib requests selenium selenium-wire). don't use the requirement.txt as it will force versioning for you. I tried to just change the user-agent, but it didn't work for me without updating the packages in a fresh environment, but I didn't look into it much more. I also tried to not change the User-agent and only use the new package version - this also did not work for me.
  3. clone the repo
  4. Change line 180 in scraper.py to allow time to manually install extension: WebDriverWait(driver, 60) --> WebDriverWait(driver, 360)
  5. run the scaper as you normally would in command line
  6. install user-agent switching: https://chrome.google.com/webstore/detail/user-agent-switcher-for-c/djflhoibgkdhkhhcedjiklpkjnoahfmg/related
  7. click on user-agent extension to change your user agent to something else (like safari).
  8. refresh blinkist page, it shouldn't force you to Cloudflare anymore. at least until cloudflare changes something lol

So if it's only changing user-agent, this should easily be implemented in the chrome options. Or can save annoyance of manual extension installation by adding this extension like how it is implemented with u-block.

the scraping script so far works in the updated packages but I haven't done any extensive testing. I didn't need privacy-pass extension, but if above doesn't work for you then you try to manually install to check.

kotobuki09 commented 3 years ago

Your method is working perfectly for me as well! Thank you for keeping this work For some of the audio, I got this error, but it seems like the majority is working fine. ERROR Request timed out or other unexpected error: HTTP Error 401: Unauthorized ERROR Error processing audio url, aborting audio scrape...

hxh103 commented 3 years ago

Due to this error happening to me all the time #58, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution.

  1. change user-agent at start: Add the following line in scraper.py (I added in line 88).
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A")

This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha.

  1. load extension at start: download the user-agent extension as a .crx file (google if if you don't know how) and place it into the bin folder (like ublock); so for me, I have it in bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you point it correctly in the code below. Then add the below line in scraper.py (I added it after line 88 as I left the first option in)
chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx"))

I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome.

vongyver commented 3 years ago

hxh103, thanks for the recommendations, glad to see it's working for you. Not working for me, tried both and switching agents about 6 times with reloads. I'm still getting the hCaptcha cycle. I expect my issue may be a little different. I am not sure what Cloudflare is using for browser fingerprinting, but I may be blocking that too.

FYI

On Mon, Jul 19, 2021 at 3:23 PM hxh103 @.***> wrote:

Due to this error happening to me all the time #58 https://github.com/leoncvlt/blinkist-scraper/issues/58, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution.

  1. change user-agent at start: Add the following line in scraper.py (I added in line 88).

chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A")

This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha.

  1. load extension at start: download the user-agent extension as a .crx file (google if if you don't know how) and place it into the bin folder (like ublock); so for me, I have it in bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you point it correctly in the code below. Then add the below line in scraper.py (I added it after line 88 as I left the first option in)

chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx"))

I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-882870264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSXI5LEES3YPW6HHHP3TYSJUVANCNFSM42ARMJMQ .

kotobuki09 commented 3 years ago

Due to this error happening to me all the time #58, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution.

1. **change user-agent at start**: Add the following line in scraper.py (I added in line 88).
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A")

This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha.

1. **load extension at start**: download the user-agent extension as a .crx file (google if if you don't know how) and place it into the bin folder (like ublock); so for me, I have it in bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you point it correctly in the code below. Then add the below line in scraper.py (I added it after line 88 as I left the first option in)
chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx"))

I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome.

Working like charm in my case, thanks hx103. Sometimes I still got network blocks or errors while backup all the files, but that's already too good already!

kotobuki09 commented 3 years ago

hxh103, thanks for the recommendations, glad to see it's working for you. Not working for me, tried both and switching agents about 6 times with reloads. I'm still getting the hCaptcha cycle. I expect my issue may be a little different. I am not sure what Cloudflare is using for browser fingerprinting, but I may be blocking that too. FYI On Mon, Jul 19, 2021 at 3:23 PM hxh103 @.**> wrote: Due to this error happening to me all the time #58 <#58>, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution. 1. change user-agent at start: Add the following line in scraper.py (I added in line 88). chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A") This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha. 1. load extension at start*: download the user-agent extension as a .crx file (google if if you don't know how) and place it into the bin folder (like ublock); so for me, I have it in bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you point it correctly in the code below. Then add the below line in scraper.py (I added it after line 88 as I left the first option in) chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx")) I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSXI5LEES3YPW6HHHP3TYSJUVANCNFSM42ARMJMQ .

Have you tried to download with a different network? Cause you already created a similar environment like me and hxh103. The only problem left is your network firewall and so on.

vongyver commented 3 years ago

Update. I was able to get it working by disabling my pihole dns for a minute during login. I am using pi-hole for DNS to block tracking, ads and malware. I expect there is something in one of the lists that is causing an issue. I will see if I can find it and if so pass it along.

I also discovered, for me at least, that I could not have a "&" in my password, as the script was not handling that properly, even with quotes.

Great to see this scraping again. Thank you hxh103!!

On Tue, Jul 20, 2021 at 3:53 AM kotobuki09 @.***> wrote:

hxh103, thanks for the recommendations, glad to see it's working for you. Not working for me, tried both and switching agents about 6 times with reloads. I'm still getting the hCaptcha cycle. I expect my issue may be a little different. I am not sure what Cloudflare is using for browser fingerprinting, but I may be blocking that too. FYI … <#m6585423364355193943> On Mon, Jul 19, 2021 at 3:23 PM hxh103 @.**> wrote: Due to this error happening to me all the time #58 https://github.com/leoncvlt/blinkist-scraper/issues/58 <#58 https://github.com/leoncvlt/blinkist-scraper/issues/58>, I got annoyed with having to reinstall user-agent every morning. I implemented 2 options to change user-agent. Unfortunately, both require manual clicking, but less work than the above solution. 1. change user-agent at start: Add the following line in scraper.py (I added in line 88). chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A") This will change it to a Safari user-agent. If this user-agent gets flagged by Cloudflare, then just change it to another user-agent. I think anything other than a chrome user-agent should work. This option always required me to do the captcha at least once so a little bit annoying. I tried option 2 below to see if I could get around solving captcha. 1. load extension at start*: download the user-agent extension as a .crx file (google if if you don't know how) and place it into the bin folder (like ublock); so for me, I have it in bin\useragent\User-Agent-1.1.0.crx. This can be anywhere as long as you point it correctly in the code below. Then add the below line in scraper.py (I added it after line 88 as I left the first option in) chrome_options.add_extension(os.path.join(os.getcwd(), "bin", "useragent", "User-Agent-1.1.0.crx")) I did not have to solve the captcha with this route, but I did have to click on the extension to change the user-agent and then reload the page. I don't know how to set user-agent from this extension automatically, but maybe this would save from clicking. If someone knows how to do this or has a better solution that doesn't require any manual clicking or captcha, that would be awesome. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment) https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-882870264>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSXI5LEES3YPW6HHHP3TYSJUVANCNFSM42ARMJMQ .

Have you tried to download with a different network? Cause you already created a similar environment like me and hxh103. The only problem left is your network firewall and so on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-883260482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSX4MU5S46AERPM6C73TYVBQ5ANCNFSM42ARMJMQ .

mandliya commented 3 years ago

None of the above options worked for me! I keep getting thrown to captcha page. Didn't even login once. (I tried the extension as well as changing user agent at the load) I will keep an eye on this thread in case someone runs into similar issue and able to solve it.

Thank you for the amazing tool.

mandliya commented 3 years ago

If anyone is still stuck with this, use undetected-chromedriver. Replace your driver with this and fix few errors of unwanted options and voila it works! 😊

vongyver commented 3 years ago

Did you need to change scraper.py to import this or anything? Are you using the original scraper.py? Thanks Ravi.

On Thu, Jul 22, 2021 at 10:09 PM Ravi Mandliya @.***> wrote:

If anyone is still stuck with this, use undetected-chromedriver https://github.com/ultrafunkamsterdam/undetected-chromedriver. Replace your driver with this and fix few errors of unwanted options and voila it works! 😊

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/leoncvlt/blinkist-scraper/issues/46#issuecomment-885384919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASTLSU5ZNWWRSRHKUV6SLTTZDTNZANCNFSM42ARMJMQ .

mandliya commented 3 years ago

Yes, imported in scraper.py to replace the existing selenium chrome driver. undetected-chromedriver has examples in their repo.

leoncvlt commented 2 years ago

Can anoyone confirm undetected-chromedriver does indeed fix the issue? If so, might be time for a PR 😄

fugohan commented 2 years ago

I have tried to use the undetected-chromedriver but I can't fix this error message. Can somebody help me?

python3 blinkistscraper ********@***m ******** --language de --audio --concat-audio --keep-noncat
[14:22:24] INFO Starting scrape run...
[14:22:25] INFO Initialising chromedriver at /home/user/.local/lib/python3.8/site-packages/chromedriver_autoinstaller/97/chromedriver...
[14:22:26] ERROR Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
  (Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)
Traceback (most recent call last):
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 412, in <module>
    main()
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 319, in main
    driver = scraper.initialize_driver(
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/scraper.py", line 102, in initialize_driver
    driver = uc.Chrome(version_main=97,
  File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 302, in __init__
    super(Chrome, self).__init__(
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
    RemoteWebDriver.__init__(
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 582, in start_session
    super(Chrome, self).start_session(capabilities, browser_profile)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 359, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
  (Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)

[14:22:26] CRITICAL Uncaught Exception. Exiting...
bl4ckOut commented 2 years ago

Can anoyone confirm undetected-chromedriver does indeed fix the issue? If so, might be time for a PR smile

Yes I can confirm it. Just like @mandliya mentioned, the undetected-chromedriver fixes the infinite captcha-loop from cloudflare.

fugohan commented 2 years ago

I have tried to use the undetected-chromedriver but I can't fix this error message. Can somebody help me?

python3 blinkistscraper  --language de --audio --concat-audio --keep-noncat
[14:22:24] INFO Starting scrape run...
[14:22:25] INFO Initialising chromedriver at /home/user/.local/lib/python3.8/site-packages/chromedriver_autoinstaller/97/chromedriver...
[14:22:26] ERROR Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
  (Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)
Traceback (most recent call last):
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 412, in <module>
    main()
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/__main__.py", line 319, in main
    driver = scraper.initialize_driver(
  File "/home/user/Downloads/blinkist-scraper/blinkistscraper/scraper.py", line 102, in initialize_driver
    driver = uc.Chrome(version_main=97,
  File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 302, in __init__
    super(Chrome, self).__init__(
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
    RemoteWebDriver.__init__(
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/user/.local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 582, in start_session
    super(Chrome, self).start_session(capabilities, browser_profile)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 359, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
    self.error_handler.check_response(response)
  File "/home/user/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: cannot parse capability: goog:chromeOptions
from invalid argument: unrecognized chrome option: excludeSwitches
  (Driver info: chromedriver=97.0.4692.20 (6559bb085abcaedffe35d268b3546c43f755151c-refs/branch-heads/4692@{#186}),platform=Linux 5.11.0-40-generic x86_64)

[14:22:26] CRITICAL Uncaught Exception. Exiting...

@fugohan to solve it see my comment below, also FYI your email and password is exposed

I will check you fix now out and thank you for the mentioning of the email ^^ Can you also edit it out?

fugohan commented 2 years ago

Can anoyone confirm undetected-chromedriver does indeed fix the issue? If so, might be time for a PR smile

it did fix the issue for me but I had to comment out undetected code inside scraper.py for it to work lines :69 to :88

I got it to work but i can't scrape any audio i get this error message

[21:27:44] ERROR 'Chrome' object has no attribute 'wait_for_request'
Traceback (most recent call last):
  File "blinkistscraper/__main__.py", line 412, in <module>
    main()
  File "blinkistscraper/__main__.py", line 368, in main
    dump_exists = scrape_book(
  File "blinkistscraper/__main__.py", line 257, in scrape_book
    audio_files = scraper.scrape_book_audio(
  File "blinkistscraper/scraper.py", line 526, in scrape_book_audio
    captured_request = driver.wait_for_request("audio", timeout=30)
AttributeError: 'Chrome' object has no attribute 'wait_for_request'
[21:27:44] CRITICAL Uncaught Exception. Exiting...
fugohan commented 2 years ago

@orenaksakal are you able to download audio with this fix?

fugohan commented 2 years ago

@orenaksakal are you able to download audio with this fix?

yes, I'm able to download epub, html and audio (concatenate works too) what I do at that point is reverting back to selenium web-driver and using chrome plugin mentioned above to switch user agent

You mean after you saved the cookies am I right?

forhobbie commented 2 years ago

@orenaksakal are you able to download audio with this fix?

yes, I'm able to download epub, html and audio (concatenate works too) what I do at that point is reverting back to selenium web-driver and using chrome plugin mentioned above to switch user agent

Hi @orenaksakal I scrapped all pdfs already, but I am having the same issue when downloading audio. Apparently, the undetected-chromedriver cannot scrap audio since that is done via a seleniumwire function.

←[2m[17:10:37]←[0m ←[34mINFO←[0m Getting all books for category Entrepreneurship... ←[2m[17:10:44]←[0m ←[34mINFO←[0m Found 216 books ←[2m[17:10:44]←[0m ←[34mINFO←[0m Scraping book at https://www.blinkist.com/en/books/15-secrets-successful-people-know-about-time-management-en-kevin-kruse C:\Users\luisa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webelement.py:446: UserWarning: find_element_by_* commands are deprecated. Please use find_element() instead warnings.warn("find_element_by_* commands are deprecated. Please use find_element() instead") ←[2m[17:10:48]←[0m ←[31mERROR←[0m requests Traceback (most recent call last): File "C:\Users\luisa\.a python\blinkist\.a test\blinkistscraper\__main__.py", line 412, in <module> main() File "C:\Users\luisa\.a python\blinkist\.a test\blinkistscraper\__main__.py", line 368, in main dump_exists = scrape_book( File "C:\Users\luisa\.a python\blinkist\.a test\blinkistscraper\__main__.py", line 257, in scrape_book audio_files = scraper.scrape_book_audio( File "C:\Users\luisa\.a python\blinkist\.a test\blinkistscraper\scraper.py", line 513, in scrape_book_audio del driver.requests AttributeError: requests ←[2m[17:10:48]←[0m ←[41mCRITICAL←[0m Uncaught Exception. Exiting...

How do you revert back to selenium? Could you please explain in a little more detail what you do after login or what you need to change in the scraper.py to make it revert automatically after successful login?

After passing the captcha with undetected-chromedriver I try to run the program again with the default driver, but the new window opening goes back to the captcha loop. I also tried modifying scraper.py and substitute the selenium webdriver by uc, but nothing. I am not very experienced in python and I ran out of ideas :( I´d appreciate it if you could share your solution! Thank you for your help, guys!

raspgalax commented 1 year ago

Also have the issue where it just says www.blinkist.com needs to review the security of your connection before proceeding. being blocked by cloudflare. None of the solutions above work :(

fanishjain commented 1 year ago

Hello, captcha page stuck. I was wondering if you got it solved, then maybe I can use yours