stephenholzman / tidyusafec

An R wrapper for the OpenFEC API that features tidy cleaning.
https://stephenholzman.github.io/tidyusafec/
Other
7 stars 0 forks source link

tidygov? (daring to dream a bit) #11

Open stephenholzman opened 6 years ago

stephenholzman commented 6 years ago

As the FEC uses the data.gov network API key, it might make sense to have a place to air thoughts on a hypothetical organization structure for a larger ecosystem of data access packages. As R and the tidy approach are international, can lump in thoughts about namespace here too.

Motivation for this is every class I ever had about about navigating government data portals instead of going into documentation and limitations. This will not stand man.

  1. If tidygov were to come into existence, it should be international. tidy[country-abbreviation][org-abbreviation] should be the naming convention. So this would be 'tidyusafec'. tidycensus might become 'tidyusacensus'. cancensus might become 'tidycancensus'.

  2. The burden of navigating between packages in the hypothetical tidygov universe should be minimized. The problems we want to solve are aligned with the general tidyverse: the difficulty of retrieving data is too high, the difficulty of wrangling data is too high, and the difficulty of replicating analysis strategies across time/geography/topics is too high.

This is all way out of scope for tidyfec, except for maybe renaming it tidyusafec. Just the best place to record thoughts for now.

stephenholzman commented 6 years ago

Researching the landscape of other wrappers intending to return tidy results of gov data. There's a variety of different approaches and styles. Lots of influence from the API they're working with.

If this is going to work, and partly to distinguish the efforts here, the goal is to make things as consistent as possible for users. Always worried about an xkcd-927 type situation.

Exploring justification, ideally all data providers would adopt a consistent approach to their APIs or data portals. Totally unrealistic. The next best thing would be consistent wrappers. Wrappers are easier to implement and do not require organizations to agree on a standard API style. Wrappers are nimble. The next decade will undoubtedly see upheaval as orgs modernize, so perhaps wrappers are best suited to quickly bring about a more harmonious experience working with data from different sources.

In parallel while working on tidyusafec, I think developing a well defined wrapper style guide would be beneficial. Definitely a bit ambitious, but let folks buy in naturally if this is truly the best approach (it would help if I actually succeed at writing software and demonstrate the potential utility).

stephenholzman commented 6 years ago

If source APIs are targeted at a wide range of developers wanting mostly hierarchical data, wrappers should be meant to give similar scaling ability to analysts who want mostly relational data. My working proposals for a wrapper standard:

  1. The top level goal is to ease the burden of data retrieval, wrangling, and replication for analysts using R given the assumption the future of science is more collaborative, inclusive, cross-disciplinary, and transparent.
  2. Wrapper libraries exist at a 1:1 ratio for organizations. If that organization publishes multiple APIs, the wrapper accesses them all.
  3. Libraries are named tidy[country-abbreviation][organization]. For example, tidyusafec, tidycancensus, tidyusabls, etc.
  4. Data output is tidy by default.
  5. Input arguments for at least topic, geography, and time are included separately when applicable, created if not directly supported by the API or bundled.
  6. Other options an API makes available are included when possible.
  7. Featured functions start with a verb, are all lower case, and use underscores for spaces. get and search are the most frequently used. Results of a search function should be pipe-able into applicable get functions, results of get functions should be pipe-able into tidyverse functions like filter and mutate.
  8. Provide original data documentation in the param descriptions when possible. Link, link, link to original documents. Give jargon-free interpretations in comment attributes on tidy dataframe variables.

Going to focus on development for a while, will revisit.