Consider Using Crawlee Cheerio Crawler For CAPTCHA Issue

StephenMP commented 4 months ago

This library is essentially an HTTP web crawler for YOPMail and part of the problem with automated web scraping is when websites begin implementing anti-bot like reCAPTCHA.

I've noticed a sharp increase in reCAPTCHA serving from YOPMail and saw another ticket on here which concluded that something about CAPTCHA needed to be implemented. Since the code base is minimized and obfuscated, I can't (very easily) see what you are doing for your web requests, but I imagine that it's all hard coded headers, fingerprints, etc. which is why this library has been easy to target for serving CAPTCHAs.

One consideration could be switching to the Crawlee CheerioCrawler to maintain the raw HTTP requests with better browser header and fingerprint rotation to reduce the instances of CAPTCHAs being served.

https://crawlee.dev/docs/guides/cheerio-crawler-guide

I wouldn't mind helping in the contribution if you were able to switch to a two branch structure (a dev branch you work from that is not obfuscated and minified and a main branch which holds the obfuscated and minified code that is packaged up for the npm package).

jasp402 commented 4 months ago

Hi. @StephenMP. It would be my pleasure to have some help. Sent you an original project invitation. So you can give it a more detailed analysis.

StephenMP commented 4 months ago

@jasp402

So bad news. I've done quite a bit of testing using different crawling techniques and it would appear that YOPMail is quite unbiased in presenting CAPTCHAs. They just CAPTCHA once an IP address seems to be doing too much volume.

Ways around this would be:

Allow a user to provide a proxy configuration and use their proxy provider for the HTTP requests
Allow a user to provide a CAPTCHA solver (e.g. 2captcha) and use it when presented with a CAPTCHA and once the CAPTCHA solver gives you the reCAPTCHA token, use that to re-issue the request in the rc parameter.

Personally, I think utilizing both of those options would be the ideal path forward.

You can tell that you've received a CAPTCHA if the response body when asking for an inbox contains <div class="mctn"><script>window.top.showRc();</script>

jasp402 commented 4 months ago

Yes, I have tried to skip the re-captcha issue. The problem with captcha solvers. You must install an extension within the browser. So they can function.

In ancient times, when using PuppeteerJS it was possible to easily add those dependencies and configure them. But now we limit ourselves to requests. Solving it by this means becomes almost impossible.

The viable option is a proxy, but as you have determined, it is an ineffective solution. Since the user will have to change their IP as soon as the re-chaptcha appears, then it is adding a manual process that does not really offer a solid solution.

I think what I should do is notify when the re-captcha appears so that the person can be informed of what is happening and not simply say that it doesn't work for them.

I thank you very much for all the support, I think there are still many things to improve. :D

jasp402 commented 3 months ago

I'm going to proceed to close this ticket. We've already seen that due to the nature with which YOPmail handles the re-captcha

jasp402 / Easy-YOPmail

Consider Using Crawlee Cheerio Crawler For CAPTCHA Issue #20