unitedstates / contact-congress

Sending electronic written messages to members of Congress by reverse engineering their contact forms.
https://theunitedstates.io/contact-congress/
Creative Commons Zero v1.0 Universal
629 stars 211 forks source link

New metadata #1285

Open drinks opened 10 years ago

drinks commented 10 years ago

There's been ongoing discussion about flagging forms that can be successfully submitted using POST directly via cURL or some other means to get some speed gains, and as it's appearing that some forms are broken by their own virtues maybe now's as good a time to approach them as any. I could see utility in adding two fields alongside bioguide, before contact_form:

notes, for listing caveats/comments in unstructured text, and

environments, where the value could be an array of javascript, curl, dom or similar keywords.

The latter would serve to indicate under what circumstances this form is able to work, where the expectation would be that curl forms could get success as a direct request, dom can be used through a non-js html interface such as mechanize, and javascript would work with a phantomjs-like environment. I'm not at all sold on these names, so suggest away.

And since there's a chance this thread could turn to multiple instruction sets per form I'd like to preemptively downvote that idea unless it becomes absolutely necessary; my hope is that the current schema can stay as close to simple manipulation of standard form elements as possible, allowing each type of end-user to interpret however works best for them.

So, thoughts?

akosednar commented 10 years ago

Wouldn't that just add more unneeded parts to the system (the environments)? Something that works via phantomjs should work via curl or another way as it simulates how a user is using the form.

On this topic, I think though a better solution for broken forms would be to allow the injection of javascript into the page. I know we are simulating a user clicking an object but sometimes adding a quick javascript snippet to the page (for example add a unique id to each form field based on the label or something) would help get over the humps of any issues while preserving authentic user interaction as much as possible.

drinks commented 10 years ago

Well taken, but perhaps it's important to reinforce that congress-forms is a client for this data, not the client. There should be no assumptions made about the integrating system other than as dictated by realities of the forms themselves. So, it's not really safe to assume PhantomJS is present at all (Formageddon uses Mechanize) and I guess I don't feel as though indicating possible environments really muddies the waters on integration--unless you're unable to support one of them and it's needed. It's useful to me to know if a form requires js, for example, because it means I can save some CPU time and go straight to fax. @hainish--you've been wanting a flag for POST-able forms, any feelings about this?

akosednar commented 10 years ago

True! Makes sense I guess you don't want to limit the person who might eventually use this dataset

Hainish commented 10 years ago

I could imagine a case where a javascript instruction is not necessary to fill out a form, but very preferable. For instance: say a form, to prevent spammers, measures the amount of time you're on the page filling out fields via javascript. If you're on the page for less than, say 20 seconds, it marks it as spam and refuses to deliver the message. Obviously, from a spam mitigation perspective, throttling can happen on the server side as well and that would be a more effective measure, but just imagine it's being done in js. In this case, we would on the implementation side either have to wait 20 seconds (thus keeping the phantomjs process running and allocated in RAM for that much longer) or game the system by using javascript to fast-forward 20 seconds and filling in the form normally. I don't think any form is actually doing anything like this, but it is possible we might come across this and we shouldn't limit ourselves to just UX interactions simply because we should be acting like end users.

@drinks I think an environments field in the YAML is fine, we could implement something like this in typhoeus, and I think it would be much quicker than PhantomJS. I can see a big gain even by specifying post request fields in a relatively few number of forms. At EFF I imagine a large minority of the form fills will go to the senators from California and NY. So by just implementing this for 4 senators, we could be reducing load a significant amount.

drinks commented 10 years ago

@Hainish To me, that is a perfect case for a manual notes entry or similar mechanism. It's a pretty slippery slope to release the entire grammar of javascript into what I hope can remain a simple and small set of instructions, reducing the amount of crazy the end user has to account for to just get started.

Once we reach a degree of complexity where a file can't be dropped into an existing arbitrary (compliant) system and expected to work, there's no longer an advantage in trying to embed those instructions in a structured format versus just plain notes--it has to be touched by hand and accounted for in the implementing tool either way. Does that sound crazy? Every pattern we attempt to encapsulate/automate in script is another layer of complexity we introduce and my inclination is to KISS on this end and override quirks in the submitter. For example, I have a special clause in Formageddon for dealing with Gillibrand's radio button topic selector, and I imagine you do too.

One thing that just occurred to me as a possible means for communication of edge cases might be a set of 'exception codes', maybe in a caveats array, indicating any weird behaviors that occur widely enough to be considered a pattern. If the implementer's tool has a code path for dealing with a given type of quirk, they could automatically invoke it if its code is seen. Non-patternable quirks could remain in plain text, perhaps. Thoughts?