ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
667 stars 44 forks source link

Captcha resolver Support #235

Open ctaity opened 3 years ago

ctaity commented 3 years ago

Do you hace plans to support captcha resolvers, for example h2captcha?

like https://www.npmjs.com/package/puppeteer-extra-plugin-recaptcha

blakebyrnes commented 3 years ago

Yeah, we'll have some support. We just haven't had the direct need or time yet to address them. Obviously it's a needed solution though. Are you a fan of the puppeteer-extra approach?

ctaity commented 3 years ago

Not a fan, but stealth and recaptcha works, i am a scrapper, i use any thing that works lol

ctaity commented 3 years ago

Can you give me an example how to use the actual support of captcha resolver?

blakebyrnes commented 3 years ago

We don't have support at the moment. I meant to answer that yes, we plan to support. We just haven't done so yet.

magicaltoast commented 3 years ago

I would like to implement it, I found article that explains how to do it, puppeteer, from scratch.

But that solution requires injecting a response from the captcha solver directly into the dom. I couldn't figure out how to do it in a secret agent.

If someone could explain to me how to do it I will try to bring captcha solving into secret agent ;)

Have a nice day

Snippet that I struggle to implement in secret agent.

await page.evaluate(`document.getElementById("g-recaptcha-response").innerHTML="${response}";`);
maximseshuk commented 3 years ago

I would like to implement it, I found article that explains how to do it, puppeteer, from scratch.

But that solution requires injecting a response from the captcha solver directly into the dom. I couldn't figure out how to do it in a secret agent.

If someone could explain to me how to do it I will try to bring captcha solving into secret agent ;)

Have a nice day

Snippet that I struggle to implement in secret agent.

await page.evaluate(`document.getElementById("g-recaptcha-response").innerHTML="${response}";`);

@magicaltoast https://github.com/ulixee/secret-agent/tree/main/plugins/execute-js

calebjclark commented 3 years ago

@magicaltoast, that was a quick find! We only pushed that code last night ;)

@maximseshuk, the plugin support needed for implementing a captcha resolver is in the new version pushed last night, however, the documentation for plugins is not ready yet. I will work on it Thursday and push to website by end of week.

ctaity commented 3 years ago

if you need tester, i have an account of h2captcha service :D

magicaltoast commented 3 years ago

I am trying to code something more 'advanced' than just injecting token every time when captcha is encountered. And currently I am struggling to wait for any of two elements. Current work around is just specify a timeout and second element will just throw and error, which is not optimal solution because browser is waiting the whole time

        const result = await Promise.race([
            mark_function(async () => { await challengeFrame.waitForElement(failureSelector, { waitForVisible: true, timeoutMs: 7500 }) }, 0),
            mark_function(async () => { await challengeFrame.waitForElement(successSelector, { waitForVisible: true, timeoutMs: 7500 }) }, 1)
        ])

Currently I am working on recaptcha and this is my progress so far

I am planning to integrate CaptchaHarvester and PocketSphinxJs tomorrow

calebjclark commented 3 years ago

@magicaltoast, yes, we have a feature deficiency in not having an easy/good way to wait for any of two elements. We are aware of this, and we are actively discussing some options.

Can't wait to use your captcha resolver! Thanks for taking the lead on this!

ctaity commented 3 years ago

@magicaltoast you have a prototype of captcha resolver??? I need to develop one for hCaptcha and reCaptcha, maybe i can contribute to your captcha resolver ?

Thanks

magicaltoast commented 3 years ago

Here is my dirty version of captcha resolver, https://gist.github.com/magicaltoast/1fe097b92272aad8a972a52fe87968c2 . I got stuck because there is no good voice to speech model for node, I currently I have not enough time to convert some python model to onnx, then create some bindings in c/c++/rust to node. I really don't want to run some model on web assembly or javascript, because it's a waste of processing power in my mind. Also, I would good idea to implement a system for reporting captchas solved incorrectly, if you want to collaborate let me know

ctaity commented 3 years ago

Thanks @magicaltoast , i will take a look, maybe we can start resolving image captchas, hcaptchas, recaptchas v2 and v3, using an captcha resolver, i use https://2captcha.com/, and in the futre implements somethig cheap or free :D, what do you think?

magicaltoast commented 3 years ago

@ctaity That would be cool, I would suggest implementing a captcha resolver as an interface rather than hardcoding calls to 2captcha, to give people the ability to add new providers without rewriting all code