Closed jefffriesen closed 2 years ago
@jefffriesen I realize that the fact that order matters in the geoHierarchy
object is a bit tricky, which is why I added that caveat. I'm sure it was frustrating to be doing the "right" thing (conventionally) with your argument, just to get an error that didn't point you in the right direction. I'm sorry for that.
I suppose this could be handled by the library, but here's an example list of supported geographies. Checking for every potential match among this variety of patterns (and the patterns themselves sometimes vary between vintages - i.e., not every vintage allows the same wildcards or uses the same geography names/predicates) would not only make the library more brittle (e.g., we just changed the geography names - regretfully - last year), but might significantly slow down the processing of the request.
I hope you understand that this would incur a significant maintenance cost that - when measured against the value it would provide - seems IMHO expensive.
That having been said, I will definitely emphasize the caveat to try to bring more attention to it. If you have any ideas of how we might better document the caveat, please let me know.
@loganpowell Thanks for your response. I see what you're saying about the large number of geographies. Of course, you could capture their inherent hierarchy and check against that, but you're right, it would be time consuming and fragile.
I ran into another fairly time consuming issue where I didn't know what to call block-group
. I know this could be considered another ticket or stack overflow, but I'm bringing it up here because it's related to the difficulties of figuring out geoHierarchies. So for this example, I got the right hierarchy (state -> county -> tract -> block-group) but I didn't know how to write it. At first I tried blockGroup
and "block group"
. Those didn't work and I was getting the same generic error message of invalid geography. If I knew the problem was just how it was written, then I would keep trying variations. But I assumed I had the wrong combinations of dataset (acs1 vs acs5) and vintage and assumed that geography wasn't supported.
Now that I know that it is block-group
I can search and find it in the repo in the tests and index.edn file (reminder to myself to always check the tests!).
You asked if I had ideas on how to document better - I'm not sure, but it feels like there's something we could do around documenting hierarchy and naming conventions. For example, I see you're sausage-casing
variable names. I write Clojure and appreciate that, but that may not be an obvious thing to try. Even if you did guess it, I'm not sure how you would handle certain cases, like place/remainder (or part)
, or american indian area/alaska native area/hawaiian home land (or part)
. I could guess from the example in this tests (school-district-_elementary_
). But even then, you may have guessed the casing correctly but got the dataset or vintage wrong and the error won't tell you.
I don't have any clear wins - I see what you're up against. I wonder if at least one of us could make a gist that shows this document converted (https://api.census.gov/data/2010/dec/sf1/geography.html) into geoHierarchies. Then it would be a bunch of examples (always helpful) that shows proper sausage casing and emphasizes the hierarchy. Just an idea.
@jefffriesen actually, though you can use sausage-case, the documented geography names are the recommended names for the geographic areas when using CitySDK.
It's important to note that not all geographies that are supported by the API are also supported by CitySDK (due to the cartographic boundary files not containing the missing subset). You can see the supported geographies at the base of the README:
https://github.com/uscensusbureau/citysdk#geographies-available-by-vintage
You mentioned "block group"
didn't work for you, but "block-group"
did, can you post the arguments you're working with?
@loganpowell Thanks for your reply. Well, I just double checked and the query and tried both sausage case ("block-group") and space case ("block group"). They both worked as you said they should. I must have seen the error when I had an incorrect dataset, vintage or ordering of geoHierarchies.
I think fundamentally this is a complicated set of data and APIs. Lot's of opportunities to get confused and I can't think of any ways right now that citySDK, as a layer on top of them, can simplify it or document it better. I'll close this ticket. If I think of something I'll create a new ticket and PR if you like the suggestion. Thank you for your help.
@jefffriesen thank you for looking into it further and definitely welcome any PRs that make this easier for users. Just for some background, the last version of CitySDK provided aliases for some popular variables, but only did so for ACS and Decennial and only for a couple of vintages. While this worked ok for those datasets & vintages, it was still pretty brittle as - currently - the variables aren't consistent between vintages and are still subject to change. The v2
citysdk now covers every endpoint availed via the Census API...
I have been pestering the API team to take the variable inconsistencies between vintages more seriously for usability reasons (also to relieve the cognitive burden on library maintainers, of which citysdk is only one of a variety of libraries currently available for various languages).
The problem is that the ownership of the data is in one set of hands, the variables in another and the API in another. You can imagine this creates some disconnects between the incentives of various internal stakeholders who have their own needs/use cases for the various components.
The primary use case of the citysdk is for helping users create web maps with Census data/geographies, hence the focus on integration between cartography files (which are far smaller/faster to work with than full resolution shapefiles), geocoding (for geolocation based mapping) and the Census statistics API. The way I explain it is that citysdk gives you three APIs for the price of one (the Census statistics API). However, the price of that API is - as you've mentioned - not cheap (it's "complicated"). I typically recommend beginners to Census data start by finding their variables of interest using https://data.census.gov and then - once they know what to look for - go to the API discovery tool at https://api.census.gov/data.html (the "discovery" tool). Also, there's an open issue that would integrate the discovery tool into citysdk:
https://github.com/uscensusbureau/citysdk/issues/339
Also, you mentioned that you know Clojure, that's awesome! I would love some help getting this prepped and lauched as a Jar for the Clojure community, but I've never done that. I've tried to make the library both friendly to Clojure and Clojurescript (.cljc
) with the appropriate #?
macros, but could definitely use some assistance if you'd be willing. Let me know if so and I'll share the link to the subtree that feeds into this repo and powers the actual NPM package. That's where the Clojure code is isolated.
@loganpowell
Thanks for the background on this. I see what you're up against.
I could see a reason for aliases, but I can see a lot of problems with it, as you've pointed out. I think more useful than aliases would be examples that include the codes that those aliases represent.
This guide, by far, was the most useful to me: https://uscensusbureau.github.io/citysdk/docs/. Second most useful was this inner page of the developer's microsite: https://www.census.gov/data/developers/guidance/api-user-guide.Query_Examples.html. It's much more useful than the main site https://www.census.gov/developers/. The main site is too general - I looked at it several times and left because it didn't look immediately useful.
Back to this site: https://uscensusbureau.github.io/citysdk/. Again - this is very useful. You have lots of examples already (Examples tab and Overview). But what if there was a tab called "Common Queries" or "Aliases" that shows the most common queries. So instead of trying to support aliases, provide examples with the actual codes the aliases represent. It would be nice if there were variations of geoHierarchies, including ones that use spaces and parenthesis so there's no confusion. A nice addition would be to have the URL query representation for each citySDK object. That page wouldn't have to have a lot of explanation - the rest of the guide does this well. You could ask the maintainers of the API for the top queries (which probably more or less map to aliases).
I would think this one page I'm describing could serve many user's needs. If I had it, I think I would have had very quick success and quite a bit of ambiguity cleared up. In the README, I would put that page near the top of the document and label it something along the lines of "immediate gratification".
I know creating a lot of examples can create a lot of maintenance overhead. If the citySDK query object has breaking changes, then you may have to update the examples. But the queries - the combination of geoHierarchies, datasets and vintages should not break as long as the Census API doesn't create breaking changes. They may not represent the latest vintage at some point, but they shouldn't break.
I typically recommend beginners to Census data start by finding their variables of interest using https://data.census.gov and then - once they know what to look for - go to the API discovery tool at https://api.census.gov/data.html (the "discovery" tool). Also, there's an open issue that would integrate the discovery tool into citysdk
I think the approach you outline now is good. https://data.census.gov is so much nicer than American Fact Finder. Going between that site and the discovery tool is clunky though. It's ok once I have it all in my mental model, but in a year, I am going to have to go through your guide again to remember how it all works.
I read ticket #339. I'm not clear how the discovery tool would work through the API but it sounds interesting.
What I think would be really convenient would be for the API URL to be built into the tables. So find "commute time" (https://data.census.gov/cedsci/table?q=commute%20time&g=&hidePreview=false&table=B08303&tid=ACSDT1Y2018.B08303&lastDisplayedRow=12&vintage=2018&mode=). Be able to pick a geography and generate the API URL right on that page. Or even just default to state='*' and figure out geographies later. I realize that work would have to be done by a different team.
With easy access to generate API URLS, you could build a little converter that took a URL and created the citySDK object. That would be kind of fun.
It crossed my mind after I said I write clojure that I should have been more clear. I know Clojurescript, but I have done very little in Clojure and the JVM. I hope to work more in Clojure over the next year, but right now I wouldn't be much help with that. Sorry. It would be nice to have a Clojure API. I think citySDK is useful for more than client-side mapping applications. I'm using this all for back-end data processing. I am fetching and doing post-processing on the variable results and saving them in tables (including geoids) to be looked up later. I know Python has an API for this type of work, but I would prefer to use Node.js. But I would rather use Clojure or Clojurescript : ).
I have a lot of deadlines unfortunately, so I can't help much, but I might be able to test or do some examples for documentation. Thanks for all your work on this.
@jefffriesen thank you for the thoughtful write up! You can thank @zhik for the great io
docs 😄 . I will confer with him about your recommendations and will come back to this issue when we have news.
Thank you again for taking the time to submit such a detailed issue. We are very much in the weeds of the implementation and getting this kind of "first blush" knowledge is very helpful indeed. In fact, that's probably why @zhik was able to put together such great docs. He only had a month to do it (from first blush).
Ok, great. Thanks @loganpowell @zhik !
I ran into a bug that was somewhat hard to track down, even though you caveated it in the documentation:
First, the code snippet:
I see this (now) on the documentation:
The problem is that we are working in Javascript, where historically, you could not rely on the order of object keys. Well I learned something today, after realizing that I had the wrong order of keys: https://www.stefanjudis.com/today-i-learned/property-order-is-predictable-in-javascript-objects-since-es2015/. Apparently, we can now count on the order of keys.
The problem is that I bet most JS devs would not expect the order to matter and the error message doesn't point you in that direction:
Also, maybe not all JS clients are using ES2015, but I have no idea the stats on that.
Behind the scenes, you are converting this geoHierarchy object to an ordered set. There is already a known hierarchy of these geographies. It seems like you could enforce an order before making the API request. Or is it somehow not deterministic?
If it is possible, it would be nice to have. I don't know how often this comes up, but my first real attempt at using this library, after doing the examples, hit this problem.
Thanks