Web service - Githubissues

sampottinger commented 11 years ago

There is interest in creating a separate web service to serve a language-agnostic API for querying TRACER data, an interface hopefully with higher level filtering and sorting constructs. Ideas?

trinary commented 11 years ago

In general, I strongly advocate for discoverable, hypermedia REST APIs for these sorts of things. So for example the root API endpoint should be a catalogue of targets that one can follow links to, and links between resources are done with full URLs (rather than ids). Example: https://api.github.com/:

{
  "current_user_url": "https://api.github.com/user",
  "authorizations_url": "https://api.github.com/authorizations",
  "emails_url": "https://api.github.com/user/emails",
  "emojis_url": "https://api.github.com/emojis",
  "events_url": "https://api.github.com/events",
  "feeds_url": "https://api.github.com/feeds",
  "following_url": "https://api.github.com/user/following{/target}",
  "gists_url": "https://api.github.com/gists{/gist_id}",
  "hub_url": "https://api.github.com/hub",
  "issue_search_url": "https://api.github.com/legacy/issues/search/{owner}/{repo}/{state}/{keyword}",
  "issues_url": "https://api.github.com/issues"
...etc
}

and in https://api.github.com/public_gists, for example:

  {
    "url": "https://api.github.com/gists/6126390",
    "forks_url": "https://api.github.com/gists/6126390/forks",
    "commits_url": "https://api.github.com/gists/6126390/commits",
    "id": "6126390",
    "git_pull_url": "https://gist.github.com/6126390.git",
...etc

Links can be followed directly by the client, without knowing anything more about the system. The primary identifier for each resource is its uri.

They also use URI Templates (http://www.rfc-editor.org/rfc/rfc6570.txt) to specify fields inside a link, there are a bunch of implementations that help fill these in, very helpful for telling a client how to make a request, as well as showing which requests accept query parameters and what those look like.

Another thing we can consider is adopting a standardized JSON structure to make writing clients easier. http://jsonapi.org/format/ is a recent entry into that field, there are others that I'm not as familiar with (more research required). The idea behind these is to separate metadata and links from properties of a resource, so a client can use links and metadata without having to know very much about the domain. Again, I'd strongly advocate for the URI-based approach, rather than listing IDs.

Thoughts?

sampottinger commented 11 years ago

Hello!

Thanks for getting that discussion started. I would like to add a few things from a user-centered design perspective.

Nature of the data

The API's presentation of the state-produced TRACER dataset is necessarily read-only.
The data consists of three classes (contributions, expenditures, loans reports) with many attributes (29, 28, and 26 respectively).
Each class has more instances than LibreOffice will open at one time (ex. expenditure data 2013).
Every instance has a unique ID (RecordID) and participate in an optional many-to-one relationship (AmendedRecordID).

Nature of the audience

A recent data science meetup discussed TRACER. Informally, the audience consisted of those working in general purpose languages but R and MATLAB came up more than once during discussion that evening.
A upcoming hack-a-thon may use this API and would see a similarly eclectic group.

Case study and use-cases

Possible case studies include http://ccf.balancedcommunity.com/ and http://tracer.sos.colorado.gov/PublicSite/homepage.aspx itself.
Both services manipulated the dataset through geographic information.
Both filter, sort, and group by amount, donor attributes (if applic.), and committee.

Value necessarily added by an API

Fast programmatic access to the dataset.
Service for filtering the dataset.
Exposure in a "friendly" format.
Discoverability of the dataset's features.

Reference datasets of similar nature

Twitter Search API (https://dev.twitter.com/docs/using-search)
Census API (http://www.census.gov/developers/data/)
Flight Aware (http://flightaware.com/commercial/flightxml/)
Starts to Denver Streets API (https://github.com/colorado-code-for-communities/denver_streets)

Thoughts We should strive for simplicity and conformity without sacrificing the added values of friendly programmatic access and filtering. However, we should also remember the simplicity of this high-attribute but shallow hierarchy data. Of course, given the summary-focused case studies, accessing individual records seems secondary to searching.

I agree with @trinary's excellent comments and I second the motion to adopt http://jsonapi.org/format/. Furthermore, given the reference datasets, URI Templates's support for form-style query expansion (RFC 6570 3.2.8) would be expected by a community familiar with the Twitter Search and Census APIs.

sampottinger commented 11 years ago

Ha ha! More thoughts? Seems like we are headed in a good direction. If that hack-a-thon goes through, there may be a push to start development on the API service soon. I will post a slightly more formal proposal in short order to get the ball rolling.

That being said, I don't want to get hung up on technology decisions too much right away but I also had a discussion with @nmcclain a while ago and it sounds like Node's non-blocking IO model fits the short lived IO / DB intensive requests for the API's DB-intensive usage cases (http://nodeguide.com/convincing_the_boss.html). Any dissent to writing the service in Node?

trinary commented 11 years ago

None at all from me. My go-to database is PostgreSQL, which is well supported in Node. Everything from basic query capability (I've used pg https://npmjs.org/package/pg) to full-blown ORM layers are available. Express.js is the only routing/middleware layer I have any familiarity with, but it seems to do the job well.

sampottinger commented 11 years ago

Yea, Postgres is typically my go-to as well. However, there were some design considerations that came up in an earlier external conversation with @nmcclain. I have copy/pasted part of that discussion below.

Thoughts on databases I was reflecting on our databases discussion. While this analysis is probably premature, our small dataset, (likely) consistent shallow hierarchy schema, and (possibly) flexible querying requirements all initially sounded SQL-friendly. However, we also don't need ACID-compliancy and our infrequent writes (without consistency requirements) seem to favor something like Mongo with friendly indexing based on log analysis.

Admittedly, DB admins / architects could probably teach me more than a thing or two. However, I have experience with programming for Postgres and Mongo alike so I am fine with either. :-) What do you think?

GIS While a slight oversimplification, the Internet seems to agree that we ought to stay close to PostGIS or MongoDB’s Geospatial Indexing if we decide to offer something along those lines[1][2]. I think you were already aware of the situation but I thought I would forward links on anyway.

Links [1] http://ralphbarbagallo.com/2011/04/02/an-overview-of-geospatial-databases/ [2] http://goo.gl/o83erJ

@nmcclain effectively responded that either would be fine (and for a project with likely such a small code base, we are at the prototyping stage so we should just choose / go). He also wisely observed that we will probably need to revisit design decisions under heavy usage anyway... if we are lucky to get that someday.

Anyway, I am conflicted on this but, if we cannot find a compelling reason for Postgres (I know, Bob Warfield, but hear me out) I suggest Mongo because:

We are in the realm of storing facts not user data - see transactions and "scientific data".
Eventual consistency fits our update model (aggregator or aggregators on the backend updating a front-end serving DB). We won't benefit from SQL DBs' ACID and, in the fullness of time and blessing of high usage, we would prefer the availability and partition tolerance part of CAP over the consistency.
We have geospatial data and Mongo's built-in geospatial support makes using hosted solutions easier if we decide to go that direction. It works out of the box for Mongolab, Mongohq but Postgis only becomes available with more expensive Heroku Postgres plans for example. The setup process is simplified as well (good friends at denver_streets ran into this https://github.com/colorado-code-for-communities/denver_streets/blob/master/initalize_environment.py).
Thanks to loose coupling, saving "documents" into Mongo could easily become rows into Postgres later if we want to replace that layer. We are in Node after all! Still, the former is available out of the box and the later takes some time.

Anyway, thoughts? Also, thanks again for your input.

nmcclain commented 11 years ago

Thanks @Samnsparky and @trinary for getting things kicked off and for the excellent insights and recommendations!! I wholeheartedly agree with the thoughts around API design, support using Node for the front-end, and lean towards MongoDB (but am also open to Postgres/PostGIS).

I have cloned a copy of this repo and have successfully played with pycotracer.get_report as shown in the README. That is super exciting!! As a next step I am working on a staging/CI server that will update itself when new commits are made.

I would like to go ahead and register a domain for this project, so we can offer SSL at some point - I am thinking of getting pycotracer.org but would love suggestions/feedback on this!

sampottinger commented 11 years ago

Hello! Thanks for trying out the library. I think the domain is a great idea but, from a nomenclature standpoint, I was envisioning pycotracer (Python colorado TRACER micro-library) as the Python library separate from the web service. Speaking of, we should probably start another repo for that after we come up with a name. Anyway, I am certainly open for about anything but something that conveys:

TRACER
Colorado
API
Open (although, that is frequently implied in .org TLD)

Maybe something like:

co-tracer.org or cotracer.org
open-co-tracer.org or opencotracer.org
co-tracer-api.org or cotracerapi.org
open-co-tracer-api.org or opencotracerapi.org
tracer-api.org or tracerapi.org

I kinda like cotracerapi.org / co-tracer-api.org but, again, I am certainly open to suggestion and I am not sure how people feel about dashes. Thanks!

trinary commented 11 years ago

Regarding naming, I like those ideas being communicated, but I worry that most people won't know or care that TRACER means campaign finance in Colorado. An alternative would be to say that colorado's data is at co.opencampaigndata.org or similar to make the purpose of the API front and center. We can be very clear in the documentation what the data source is. Just some food for thought, I'm absolutely fine with the direction Sam is heading if we decide that's how best to name it.

Databases: I've been burned by Mongo a few times in the bad old days of version 1.4 (a few years ago I helped found a startup that auto-provisioned Mongo replica sets on EC2, MongoHQ style and did a lot of DevOps work there). That said, a nearly-no-writes scenario is much easier to deal with than anything I ever had fall over, as long as we can always reconstruct the database from the source material. My judgement may be clouded by my own anecdotal evidence here. :smile:

sampottinger commented 11 years ago

@trinary, thanks for the insight. We can keep that DB discussion going if more thoughts come up but, given the direction of the conversation, would it be OK to assume Mongo for now? Also, thanks for adding the point about being able to reconstruct the database from source material. I suppose that the state offers a certain level of redundancy for us. :)

As for names, I would be happy to step back from TRACER. I second co.opencampaigndata.org.

sampottinger commented 11 years ago

Hello again!

While I like opencampaigndata.org's implied invitation for other states and sources to jump on-board and offer up APIs, I wonder if current and future TRACER-like systems elsewhere offer more, less, or entirely different data? While that ambition might be getting ahead of ourselves, there may be non-technical hinderances to a standardized API.

Of course, the subdomain could delineate different APIs but standard practice and prevailing literature seems to dictate putting the docs and API endpoints off of subdomains like api.opencampaigndata.org and developer.opencampaigndata.org (Mulloy 23). I suppose we could use co-api.opencampaigndata.org and co-docs.opencampaigndata.org but that feels odd in an ecosystem of dev.twitter.com and developer.facebook.com. So, how do you all feel about a domain name that limits the scope to Colorado? Something like cocampaigndata.org => api.cocampaigndata.org and developer.cocampaigndata.org? Not sold on cocampaigndata.org and, of course, open to suggestion.

I dunno... thoughts? Thanks!

nmcclain commented 11 years ago

Another option would be to use subdomains for each state. So co.opencampaigndata.org => api.co.opencampaigndata.org, developer.co.opencampaigndata.org, docs.co.opencampaigndata.org...

If we were to go that route, we could use a redirect or "index" page to help people navigate to the correct state's dataset. It seems like this would be compatible with @trinary's comments about having a discoverable API as well (api.opencampaigndata.org could provide links to the individual state APIs even if they were not consistent between states).

Ultimately I don't have a strong preference (you have had LOTS of great domain ideas @Samnsparky!), but my finger's on the "buy" button and ready to pull the trigger as soon as we settle on something.

sampottinger commented 11 years ago

@nmcclain, I agree. My apologies... I think I was missing the obvious resolution there. api.co.opencampaigndata.org would be great. Thanks!

nmcclain commented 11 years ago

I have ordered opencampaigndata.org.

Perhaps this needs to be formalized at some point, but I want to be super clear that my intentions are to donate this domain purchase to this community project. I will help manage the domain while I'm actively involved in the project, but I'm committed to transferring control of the domain to an active developer if I ever leave the project.

sampottinger commented 11 years ago

Great! Thanks for doing that @nmcclain.

Also, @trinary, I am sorry... I want to make sure I didn't shut down the conversation about DBs prematurely. What were some of the things that went poorly with Mongo? Are there some things you ran into that we could avoid if we go with Mongo for this project? Are there any issues bad enough that you think should push us away from Mongo?

Thank you both again for everything. This is very exciting!

trinary commented 11 years ago

Sorry, I should have posted to settle the issue. I'm ok with using Mongo, everything I've had go wrong dealt with high write volume or sharding. I have lost valuable data in those scenarios, though.

sampottinger commented 11 years ago

@trinary, thanks for meeting IRL. Looking forward to collaboration.

Anyway, @trinary and I looked over an API spec and reached agreement. I will post a formal copy of that API spec soon. However, to keep development going, a new repository has been created at https://github.com/Samnsparky/co_opencampaigndata for this API service. So, I think that repository supersedes this thread. I look forward to continuing the discussion there.

Thanks!

sampottinger / pycotracer

Web service #1