Open beechnut opened 11 years ago
This is good. I like it.
A rake task would be required to query each of the summary levels on a state by state basis and build out the dictionary with the responses. But this feature would also allow the gem to return human readable results:
[{"P0010001"=>"722023", "P0390001"=>"140412", "county"=>"Suffolk County", "state"=>"Massachusetts", "state_id"=>"25", "county_id"=>"025"}]
Just a couple issues to think through:
I like how you structured the return object. You're right - gives the user more to work with.
@client.locate 'Suffolk'
return? (Not quite following yet.){AIANNH: {short_name: 'American Indian Area', long_name: 'American Indian Area/Alaska Native Area/Hawaiian Home Land'}}
and return the short_name
in the results object.
I definitely want to be able to query objects independent of nesting, and it would be fantastic if the gem allowed this. However, I think that until the API itself can handle parent-independent querying, we should enable it only for objects nested one level deep.
Those levels would be:
Enabling parent-independent queries for fields that are nested one level deep, e.g. ZCTAs (nested in STATE), we only have to know the 52 IDs for the states, running something like:
(1..52).each { |id| @client.find("P0010001", "ZCTA5", "STATE:#{id}") }
However, for multiple nesting levels, we would need to know all of the ids of every level above it, and querying, say, all block groups would return tens of thousands of objects.
One more question here is how to look up, for example, a single state-independent ZCTA, as in:
@client.find('P0010001', zcta: 02139)
FYI I have the hash syntax (not the text lookup) for the level
parameter in a good spot. Will jam on within
tonight/tomorrow and push those changes soon. Still using numerical IDs.
Good stuff.
I created another gem around the same time as this one, census_shapes, which imports the census summary level boundaries into a postgis database.
There are a couple of files in there that might be of use here:
Additionally:
The main hierarchy structure is :
STATE > COUNTY > TRACT > BLOCKGROUP > BLOCK
Also, worth noting, their id at each level indicates their parents
state (2 digits) county (3 digits) census tract (6 digits) block group (1 digit) block (2 digits)
Additionally, SD, CD, SLDU, SLDL and PLACE are under STATE. And, VD and COUSUB are under COUNTY.
See the page 16 of the Census SF1 PDF
In regards to creating the geography dictionary, I would probably write a script to create yaml, like us_states, for every summary level. The only additional data I would add, would be parental hierarchy.
In fact, if I remember correctly, for the TIGER dataset every state has an SF1 file which serves as an index. That SF1 file contains a list of every geography in the state at every level. It's unfortunately not a csv, and not easily parsed, but I have some code somewhere that will do it.
With that being said, it might just be easier to write a rake task that queries the census api to build the index with the results.
Got the following message via email from github / @beechnut, but it didn't show up even though the email link brought me here. Pasting it in and commenting for posterity.
Thanks for the clarification on summary level vs geography. (Still learning!) Looks like geography is the descending center column on p.16 of the SF1 PDF, and the sumlevs are in the wings as well as the center column -- right?
I spent a month trying to grok that damn sf1 doc. Don't worry about it. When I say write geography I mean an actual physical geographic entity. Summary levels are types of geography as determined by the US Census. So California is a 'geography', and the sumlevel is 040, STATE. On page 16, that diagram shows the relationship between all sumlevels from a hierarchical point of view.
@client.locate Thanks for clarifying. So if I understand, @client.lookup 'Suffolk' would return the ids for geographies that match (or amatch) 'Suffolk'. Ideally we'd see parent info here too, when relevant:
Suffolk County: [id: 25, state: 25, state_name: "Massachusetts"] Suffolk County: [id: 103, state: 99, state_name: "State"] Suffolk city: [id: 800, state: 99, state_name: "State"] # didn't know where the other Suffolks were. If only we had a function to return this! :D
Correct. Basically, I was just suggesting a way to do quick geographical look up - especially if we get fuzzy search in there - so you can find the proper geography before querying the census api.
Parent Relations "Some of the summary levels can only be looked up in relation to their parent level. Perhaps we return all matches across all parents? This seems like it would be ok, because results return the name, or id, of the resulting area."
I think the gem should work in a manner consistent with the API: raise an error if the parameters are off, and return strictly the correct results. I do not want to force a user to sift through the results afterwards, and would much rather require them to get the parameters right before the request.
Agreed.
I'm thinking of this specifically for states and counties, but
Pseudocode for the workflow if within else end
Perhaps, the post didn't come through because of the above code?
ZCTA sidenote I'm lobbying the Census bureau to open up 860 so we don't have to work around this limitation.
860 is the sumlevel ID for ZCTA5 - Zip Code Tabulation Area. My understanding is that the way the Post Office assigns Zip Codes to new addresses is fairly organic, and therefore the boundaries for zipcodes are not very well structured and always in flux. ZCTA5's attempt to solve this by determining the majority zipcode for any given block, and then grouping blocks with the same zipcode into larger geographies, ZCTA5s.
P.S. Thing I wish I'd known a few days ago, so I'm writing it explicitly for any future devs who come through here: census_shapes.yml contains all the queryable summary levels, as defined by the Census Bureau in these two docs: ACS5 SF1
Yeah, sorry, wish I would have remembered sooner. Been a while since I worked on this stuff.
Just posted the actual comment -- I'd accidentally hit 'Comment' before I was finished. Thank you for the seemingly precognizant feedback!
EDIT: Annnd now it looks like the actual comment didn't get posted. Ugh.
Anyway, what didn't come through was, with a YAML file containing states and counties, it's not hard to search for nested geographies.
---
- name: Massachusetts
id: 25
counties:
- name: Plymouth County
id: 23
- name: Suffolk County
id: 25
- name: Worcester County
id: 27
...
And to get the right ids (pseudocode):
@client.find('P0010001', county: 'Suffolk County', state:'MA')
y = YAML.load(File.read('lib/yml/states_test.yml'))
state = y.select{|e| e['abbr'] == within.value or e['name'] == within.value }.first #=> Object for Massachusetts
county = state[level.key.pluralize][level.value] #=> 25
Yeah, I thought of this. My only concern was that not all sumlevels are hierarchal under states. But perhaps those sumlevels are just at the root of the document like state? If that is the case then we would want to add a type field,
type: 'STATE'
To differentiate between the different root document sumlevels. But in that way you could have all the geographies in one document, which I am fine with.
One year later, I'm returning to this, as I'm going to need this gem for a project at work in the near future. In that year I've gotten much better at Ruby, so I'm looking forward to contributing again.
Let users indicate states and counties by name instead of numerical code, using hash syntax.
Also should accept symbol as a wildcard field name, plural field names, and multiple 'level' values as an array:
This will mean the keys will be upcased to become API URL parameter names. The values will be looked up in a hash and converted to digits for the URL parameter values.
When multiple geometry parameters need to be specified for 'in', I imagine the following: