tyrauber / census_api

A Ruby Gem for querying the US Census Bureau API
MIT License
30 stars 9 forks source link

Allow text specification of states, counties #4

Open beechnut opened 11 years ago

beechnut commented 11 years ago

Let users indicate states and counties by name instead of numerical code, using hash syntax.

@client.find('P0010001', county: 'Suffolk', state: 'MA')
@client.find('P0010001', county: 'Suffolk County', state: 'Massachusetts')
@client.find('P0010001', county: 25, state: 25)

Also should accept symbol as a wildcard field name, plural field names, and multiple 'level' values as an array:

@client.find('P0010001', :county)
@client.find('P0010001', :states)
@client.find('P0010001', states:[25,26])

This will mean the keys will be upcased to become API URL parameter names. The values will be looked up in a hash and converted to digits for the URL parameter values.

When multiple geometry parameters need to be specified for 'in', I imagine the following:

@client.find('P0010001', :submcd, {state: 72, county: 127, cousub: 79693})
tyrauber commented 11 years ago

This is good. I like it.

A rake task would be required to query each of the summary levels on a state by state basis and build out the dictionary with the responses. But this feature would also allow the gem to return human readable results:

[{"P0010001"=>"722023", "P0390001"=>"140412", "county"=>"Suffolk County", "state"=>"Massachusetts",  "state_id"=>"25", "county_id"=>"025"}] 

Just a couple issues to think through:

beechnut commented 11 years ago

I like how you structured the return object. You're right - gives the user more to work with.

{AIANNH: {short_name: 'American Indian Area', long_name: 'American Indian Area/Alaska Native Area/Hawaiian Home Land'}}

and return the short_name in the results object.

Parent-Independent Querying

I definitely want to be able to query objects independent of nesting, and it would be fantastic if the gem allowed this. However, I think that until the API itself can handle parent-independent querying, we should enable it only for objects nested one level deep.

Those levels would be:

Enabling parent-independent queries for fields that are nested one level deep, e.g. ZCTAs (nested in STATE), we only have to know the 52 IDs for the states, running something like:

(1..52).each { |id| @client.find("P0010001", "ZCTA5", "STATE:#{id}") }

However, for multiple nesting levels, we would need to know all of the ids of every level above it, and querying, say, all block groups would return tens of thousands of objects.

One more question here is how to look up, for example, a single state-independent ZCTA, as in:

@client.find('P0010001', zcta: 02139)
beechnut commented 11 years ago

FYI I have the hash syntax (not the text lookup) for the level parameter in a good spot. Will jam on within tonight/tomorrow and push those changes soon. Still using numerical IDs.

tyrauber commented 11 years ago

Good stuff.

I created another gem around the same time as this one, census_shapes, which imports the census summary level boundaries into a postgis database.

There are a couple of files in there that might be of use here:

Additionally:

In regards to creating the geography dictionary, I would probably write a script to create yaml, like us_states, for every summary level. The only additional data I would add, would be parental hierarchy.

In fact, if I remember correctly, for the TIGER dataset every state has an SF1 file which serves as an index. That SF1 file contains a list of every geography in the state at every level. It's unfortunately not a csv, and not easily parsed, but I have some code somewhere that will do it.

With that being said, it might just be easier to write a rake task that queries the census api to build the index with the results.

tyrauber commented 11 years ago

Got the following message via email from github / @beechnut, but it didn't show up even though the email link brought me here. Pasting it in and commenting for posterity.

Thanks for the clarification on summary level vs geography. (Still learning!) Looks like geography is the descending center column on p.16 of the SF1 PDF, and the sumlevs are in the wings as well as the center column -- right?

I spent a month trying to grok that damn sf1 doc. Don't worry about it. When I say write geography I mean an actual physical geographic entity. Summary levels are types of geography as determined by the US Census. So California is a 'geography', and the sumlevel is 040, STATE. On page 16, that diagram shows the relationship between all sumlevels from a hierarchical point of view.

@client.locate Thanks for clarifying. So if I understand, @client.lookup 'Suffolk' would return the ids for geographies that match (or amatch) 'Suffolk'. Ideally we'd see parent info here too, when relevant:

Suffolk County: [id: 25, state: 25, state_name: "Massachusetts"] Suffolk County: [id: 103, state: 99, state_name: "State"] Suffolk city: [id: 800, state: 99, state_name: "State"] # didn't know where the other Suffolks were. If only we had a function to return this! :D

Correct. Basically, I was just suggesting a way to do quick geographical look up - especially if we get fuzzy search in there - so you can find the proper geography before querying the census api.

Parent Relations "Some of the summary levels can only be looked up in relation to their parent level. Perhaps we return all matches across all parents? This seems like it would be ok, because results return the name, or id, of the resulting area."

I think the gem should work in a manner consistent with the API: raise an error if the parameters are off, and return strictly the correct results. I do not want to force a user to sift through the results afterwards, and would much rather require them to get the parameters right before the request.

Agreed.

I'm thinking of this specifically for states and counties, but

Pseudocode for the workflow if within else end

Perhaps, the post didn't come through because of the above code?

ZCTA sidenote I'm lobbying the Census bureau to open up 860 so we don't have to work around this limitation.

860 is the sumlevel ID for ZCTA5 - Zip Code Tabulation Area. My understanding is that the way the Post Office assigns Zip Codes to new addresses is fairly organic, and therefore the boundaries for zipcodes are not very well structured and always in flux. ZCTA5's attempt to solve this by determining the majority zipcode for any given block, and then grouping blocks with the same zipcode into larger geographies, ZCTA5s.

P.S. Thing I wish I'd known a few days ago, so I'm writing it explicitly for any future devs who come through here: census_shapes.yml contains all the queryable summary levels, as defined by the Census Bureau in these two docs: ACS5 SF1

Yeah, sorry, wish I would have remembered sooner. Been a while since I worked on this stuff.

beechnut commented 11 years ago

Just posted the actual comment -- I'd accidentally hit 'Comment' before I was finished. Thank you for the seemingly precognizant feedback!

EDIT: Annnd now it looks like the actual comment didn't get posted. Ugh.

beechnut commented 11 years ago

Anyway, what didn't come through was, with a YAML file containing states and counties, it's not hard to search for nested geographies.

---
- name: Massachusetts
  id: 25
  counties:
  - name: Plymouth County
    id: 23
  - name: Suffolk County
    id: 25
  - name: Worcester County
    id: 27
...

And to get the right ids (pseudocode):

@client.find('P0010001', county: 'Suffolk County', state:'MA')
y = YAML.load(File.read('lib/yml/states_test.yml'))

state = y.select{|e| e['abbr'] == within.value or e['name'] == within.value }.first #=> Object for Massachusetts
county = state[level.key.pluralize][level.value] #=> 25
tyrauber commented 11 years ago

Yeah, I thought of this. My only concern was that not all sumlevels are hierarchal under states. But perhaps those sumlevels are just at the root of the document like state? If that is the case then we would want to add a type field,

type: 'STATE'

To differentiate between the different root document sumlevels. But in that way you could have all the geographies in one document, which I am fine with.

beechnut commented 11 years ago

Correct -- right now, all the API's sumlevels nest under state, or are top-level. If you take a look at these SF1 and ACS5 API docs, every field is prefixed with state-, or is top-level.

And as the API gets more complex, we'll be able to add other sumlevel documents for other roots.

beechnut commented 10 years ago

One year later, I'm returning to this, as I'm going to need this gem for a project at work in the near future. In that year I've gotten much better at Ruby, so I'm looking forward to contributing again.