unitedstates / contact-congress

Sending electronic written messages to members of Congress by reverse engineering their contact forms.
https://theunitedstates.io/contact-congress/
Creative Commons Zero v1.0 Universal
630 stars 211 forks source link

RE-CAPTCHA issues #1434

Closed j-ro closed 8 years ago

j-ro commented 9 years ago

Anyone else having Re-Captcha issues recently? I know this is something we've feared...

I'm not seeing the new style ones on any forms yet, but the old style ones seem to be returning nearly impossible images through phantom-js, like this:

https://can2-dev.s3.amazonaws.com/development/captchas/1b0209b62f9a37d9adc286fde8.png

It's basically gibberish, and I can never get it right. But, when I load the page directly (in this case, http://mcclintock.house.gov/contact/email-me, zip 95746, 8501), the captcha I get is much more human friendly every time, making me think Re-Captcha knows phantom-js isn't human and is serving nearly impossible captchas on purpose.

Anyone else see this?

crdunwel commented 9 years ago

Rand Paul seems to have started to use the new style Re-Captcha ....

http://www.paul.senate.gov/connect/email-rand

Do we have a way of handling these yet?

j-ro commented 9 years ago

Oy...I thought he had taken such a step forward by taking the captcha off his form...

What happens if you click the div to check that box in yaml? Does it think we're a spammer?

crdunwel commented 9 years ago

Figured I'd ask first in case somebody has cooked up something clever. Let me try that now.

j-ro commented 9 years ago

Nope, this is the first. We were kind of fearing this day...

sinak commented 9 years ago

:pray: :pray: :pray: :pray:

crdunwel commented 9 years ago

So .recaptcha-checkbox-checkmark is in an iframe - possible to select elements within iframes?

j-ro commented 9 years ago

hm....no, don't think it is. Sina?

sinak commented 9 years ago

I'm not sure. How about a snippet like this one: https://github.com/unitedstates/contact-congress/blob/7c82181c94d8a4aa11a79bf560bf71c439622bde/support/recaptcha-noscript.yaml

sinak commented 9 years ago

That looks like it's actually visiting the URL, not sure if it'll work.

j-ro commented 9 years ago

Yeah, that's not going to work. Maybe iframes will work, not sure -- I mean, we are running a browser, and browsers can hit iframes themselves, but really not sure how that's all going to work when passed through javascript. I doubt it'll work though...

crdunwel commented 9 years ago

Too bad his site doesn't even work without javascript...

@j-ro we are running a browser so it should be possible to do this somehow. Even if the click doesn't work initially, we can grab the captcha image(s) that come up and send them to the user to be solved. Phantom of the capitol might need to modified to handle this case though....

drinks commented 9 years ago

Hi folks!

FWIW, it looks like this might be solvable by grabbing the k query param from the iframe source, and supplying it in a POST request to https://www.google.com/recaptcha/api2/userverify from the same browser session, (so for example, https://www.google.com/recaptcha/api2/userverify?k=6Ld4GgUTAAAAAKvWFV6QlIupzvQ2_nYZt3WkYHTq after visiting the form).

The response should be unparseable json that you can string manipulate into an array where index 1 is the value you should dump into the hidden textarea immediately after the iframe (you'll probably have to unhide it before you can select/manipulate it).

I don't have a test env configured, but it looks at a glance like that may do it. There's probably some other factor though bc that seems too easy--no images or anything!

j-ro commented 9 years ago

huh...interesting. How would we do that in yaml? Probably can't right now, would require something like https://github.com/EFForg/phantom-of-the-capitol/issues/51

Or https://github.com/unitedstates/contact-congress/issues/1421 so we can select iframes...

j-ro commented 9 years ago

and yes, @crdunwel, if we could click that checkbox, even if we're thought to be a spammer then we can get the captcha image after a wait or something, though it may all still be in an iframe...

drinks commented 9 years ago

probably would need some set of captcha steps specific to this version of recaptcha that just attempts to solve it automatically right before submit, but you may still be SOL if you can't manipulate the contents of the frame and are presented with an image to solve..

j-ro commented 9 years ago

I mean, we could probably even get by with a click command that clicks a certain coordinate relative to the parent -- we could ignore the iframe entirely and just click on some pixel offset relative to the .g-recaptcha div that surrounds the iframe. Then we could take a picture of the iframe parent div to get the captcha image.

crdunwel commented 9 years ago

@drinks Wow, could circumventing new style google recaptcha really be that easy? Have others written about this strategy?

@j-ro Yeah, we'd need a way to find selectors in iframes. This seems like it should be possible but @Hainish would probably know more about it.

Either way it seems like we'd need captcha specific steps for this version of recaptcha like @drinks suggests.

j-ro commented 9 years ago

@Hainish or @drinks, does anyone have time to work on this? @crdunwel, you too if you want to hack on phantom-dc. My team might, but might not be for a week or so...

kevdev424 commented 9 years ago

Just an FYI, here's an example of how we handled another recaptcha form.

We implemented something similar to this issue brought up a while back. This strategy has been working like a charm, it's just a matter of moving between iframes.

Now, trying to handle the case where the new recaptcha serves up a list of pictures to choose from instead of entering in text is a separate problem...

j-ro commented 9 years ago

Can you contribute your iframe code back to the phantom-do repo?

On Apr 22, 2015, at 9:35 AM, Kevin Thayer notifications@github.com wrote:

Just an FYI, here's an example of how we handled another recaptcha form.

We implemented something similar to this issue brought up a while back. This strategy has been working like a charm, it's just a matter of moving between iframes.

Now, trying to handle the case where the new recaptcha serves up a list of pictures to choose from instead of entering in text is a separate problem...

— Reply to this email directly or view it on GitHub.

kevdev424 commented 9 years ago

We stopped using congress-forms a little while back and ended up writing our own implementation in C# using phantomjs. It's in the process of being released to the public while we use it, but since it's in C#, no one probably wants it anyway...

However, here's a snippet of what's going on to move between iframes and the parent frame, it's supported on any language:

if (iframe.Back)
{
   driver.SwitchTo().Window(driver.WindowHandles.First());
}
else
{
   var element = driver.FindElement(By.CssSelector(iframe.Selector));
   driver.SwitchTo().Frame(element);
}
j-ro commented 9 years ago

Ah cool. Well this is useful too!

On Apr 22, 2015, at 9:58 AM, Kevin Thayer notifications@github.com wrote:

We stopped using congress-forms a little while back and ended up writing our own implementation in C# using phantomjs. It's in the process of being released to the public while we use it, but since it's in C#, no one probably wants it anyway...

However, here's a snippet of what's going on to move between iframes and the parent frame, it's supported on any language:

if (iframe.Back) { driver.SwitchTo().Window(driver.WindowHandles.First()); } else { var element = driver.FindElement(By.CssSelector(iframe.Selector)); driver.SwitchTo().Frame(element); } — Reply to this email directly or view it on GitHub.

kevdev424 commented 9 years ago

I updated the issue I created a bit if we want to settle on a syntax for describing this action in a yaml and we're open to other suggestions on syntax or how to handle this problem overall. But weaving in and out of iframes as worked so far.

j-ro commented 9 years ago

generally seems fine to me -- not sure we actually need the iframe1/parent1 thing, as in, since you can only move in and out of iframes, you probably don't need to give them numbers. So, you move in and out of iframes and use the parent command to go up one level. Like:

    - iframe:
      - selector: "[title='recaptcha widget']"
        # now we're in iframe context
    - click_on:
      - value: recaptcha
        selector: ".recaptcha-checkbox"
    - iframe:
      - back: true
        # now we're back to parent
    - wait:
      - value: 2
    - iframe:
      - selector: "[title='recaptcha challenge']"
        # back to iframe context
    - fill_in:
      - name: recaptcha_response_field
        captcha_selector: "body img"
        value: $CAPTCHA_SOLUTION
        selector: "#default-response"
        required: true
    - click_on:
      - value: recaptchaSubmit
        selector: "#recaptcha-verify-button"
    - iframe:
      - back: true
        # and back to parent

Potentially, this could do nested iframes, just call the iframe command with a selector twice to move down two levels, and back twice to move back up.

But maybe I'm missing a use case or implementation detail there.

kevdev424 commented 9 years ago

I only added the name property to iframe because it seemed like a helpful identifier, but it isn't needed to do anything. But yeah, there was an intention to be able to go N deep into iframes if necessary.

j-ro commented 9 years ago

Cool, either way seems good.

I'd suggest this is probably the way to go -- iframe functionality is useful outside of the recaptcha context too.

@drinks, @Hainish, @crdunwel, if you agree, let me know if you have time to work on this. As I noted above, my team might, but wouldn't be until some time next week, so if anyone has time to tackle this sooner, would be great...

drinks commented 9 years ago

I might suggest renaming the iframe step to be more consistent with the existing syntax--since the current step names are cribbed from capybara, I'd recommend borrowing the name of their within_frame function, and literally nesting all subsequent steps below it, rather than implementing a back to step back out--just return to the main document once the within_frame steps are done. I'm afraid I've got no knowledge of the existing stack you guys are using to fill out forms, but I'm happy to provide assistance where I can. Leading any effort on this is going to be beyond my capacity though, I'm afraid.

kevdev424 commented 9 years ago

I didn't see a way to nest step types together in the existing syntax. How would you nest fill_ins and click_ons in a step such that when you were finished with them all you'd know to back out of the frame without a breaking change?

drinks commented 9 years ago

My (perhaps naive) first instinct might look something like this (edited to make a little more sense, sorry):

steps:
  - do_something:
    ...
  - within_frame:
    - selector: "#myIframe"
      steps:
        - click_on:
          ...
        - fill_in:
          ...
  - do_something_else:
    ...

There are, of course a couple of assumptions being made, and that's where my naievete may come in. My formageddon implementation was built to ignore any steps that it didn't recognize... essentially iterating over all steps and calling whichever method matched the step name with the supplied params. If the existing state(s) of the art calls methods based on step names instead, they may need to catch exceptions where methods aren't defined. Adding support for recursion like this would require a code change, but any solution will require a code change, right? In the case where a system was unfamiliar with within_frame I'd expect it to skip to the next top-level step and ignore the nested ones. Up to the current maintainers how to proceed, but this approach feels most flexible to me.

j-ro commented 9 years ago

I agree on the recursion being a bit weird in the schema. We're implementing this now, and I think we're going to do it this way:

   - iframe:
      - selector: "[title='recaptcha widget']"
        # now we're in iframe context
    - click_on:
      - value: recaptcha
        selector: ".recaptcha-checkbox"
    - iframe:
      - back: true
        # now we're back to parent
    - wait:
      - value: 2
    - iframe:
      - selector: "[title='recaptcha challenge']"
        # back to iframe context
    - fill_in:
      - name: recaptcha_response_field
        captcha_selector: "body img"
        value: $CAPTCHA_SOLUTION
        selector: "#default-response"
        required: true
    - click_on:
      - value: recaptchaSubmit
        selector: "#recaptcha-verify-button"
    - iframe:
      - back: true
        # and back to parent

So, commands move you in and out of an iframe's context.

j-ro commented 9 years ago

So, at least for rand paul, simply removing the captcha iframe from the DOM via the new javascript command works. Which is good because even with iframe access, fooling these captchas will be very hard.

We're still implementing iframes and committing back. Syntax will be something like:

- click_on:
        - selector: "#recaptcha-anchor"
          within_iframe: ".g-recaptcha iframe"

So you can tell any command to execute within a frame defined by a selector. Stepping in and then back from iframes proved too cumbersome with how the steps are executed and stored. This way seems cleaner and just as flexible.

kevdev424 commented 9 years ago

@j-ro Not sure if it helps at this point but if you take a look at some of our recaptcha v2 yamls you can see how we are dealing with it. We get a lot of the 3x3 picture captchas now which can be tricky but it's basically the same strategy as the word images.

j-ro commented 9 years ago

Can you link me to an example? Paul's can be defeated by removing it, but others aren't so easy it seems. I wasn't able to find an example in your files.

kevdev424 commented 9 years ago

Yeah https://github.com/NGPVAN/contact-congress/blob/master/notready/state_ne_gov_pete_ricketts.yaml

We haven't deployed our change yet but this is the general idea. Since it's so annoying to deal with, we lumped the audio version, text version, and 3x3 picture version in the same element. Then depending on which one we get from his webpage we use the appropriate selectors.

j-ro commented 9 years ago

hm, gotcha -- can you share the relevant code for your recaptcha action implementation? Are you still taking a screenshot here and showing it to users? Or showing them full html and replaying it back somehow?

kevdev424 commented 9 years ago

We developed a different app to consume the yamls so I can't share the code just yet. But we are downloading the image via its url (didn't need to do the screenshot for these) and passing that back. Then on the front end we wrote javascript on top of that image to simulate the same experience, map those values to an array, and send that forward so our app knows which sections to click.

j-ro commented 9 years ago

gotcha, makes sense, thanks!

j-ro commented 9 years ago

@kthayer424 question for you on recaptchas. We've made it pretty far into an implementation but we're getting stymied in certain places. Have you found that the captcha will demand multiple correct answers in a row before it lets suspicious users pass?

That seems to be what we're seeing. On the console, from a laptop IP, keeping one session alive, the captcha might demand a few image grids to fill out before it says we successfully passed. But when testing through phantom of the capitol, since it destroys sessions after each fill, we're never able to get it to pass, even after dozens and dozens of attempts.

Have you seen this behavior? We're thinking we might have to keep sessions until it passes, but we're not exactly sure if this is the issue.

kevdev424 commented 9 years ago

Personally, I've seen both. Sometimes I need to fill out several grids by hand before it will let me through, other times the first one will work. It's hard to tell if it's testing me because a lot of the images are terrible or misleading (a roast beef sandwich should not be considered a cheeseburger).

So, it should be possible to pass the captcha in one grid, but maybe it does take a less suspicious user session to make that happen. We don't have enough data yet to back that up but have gotten a captcha to pass this way using a brand new session of phantom. We're testing it more currently, so I can get back with a more concrete conclusion hopefully.

j-ro commented 9 years ago

Yeah, that's basically where we're at too. Will let you know what we find!

j-ro commented 9 years ago

So yep, we've got it working by saving the session -- you usually have to fill it twice correctly to pass. So we look to see if we passed and if we didn't, we give the user a new challenge, rather than submitting the form, killing the session, and starting over. So just FYI, you'll may hit that too.

gcosta commented 9 years ago

We also force a user agent that always loads the image grid without the first click, so you don't have to deal with multiple types of challenges.

To put it into more technical terms, I changed the behavior of the form_fill_handler thread. The existing system took into consideration that in any form, a captcha would only have to be solved once. Which means that after getting a captcha url and running the fill-out-captcha method, it would run the thread to the end and then kill it, which means that multiple $CAPTCHA_SOLUTION steps or my recaptcha step that retries until it's solved would simply hang on the webserver since the thread would not go to a waiting state again.

https://gist.github.com/gcosta/fefd2871991362d829ea

This snippet explains my solution for keeping multiple captcha attempts within the same browser session.

gcosta commented 9 years ago

Oh, and we also had to change the fill-out-captcha action on the main controller. We don't delete the form_hash uid that stores the thread anymore if the response status is captcha_needed. Without that uid we can't go back to that session.

We'll try to clean up and organize the prototype into a pull request soon, so you'll be able to take a look at all these changes. Amazingly enough, the changes to the fill handler and the ones to the recaptcha step haven't broken any existing tests.