heatseeknyc / landlord-lookup-gateway

PostgreSQL backend + REST gateway for the Landlord Lookup portal.
Other
1 stars 1 forks source link

Wrap multi-building response structs from NYC Geoclient #1

Closed wstlabs closed 8 years ago

wstlabs commented 8 years ago

TL;DR we did some edge-case checking on the NYC Geoclient API and realized it doesn't just get confused about certain building-taxlot combinations -- it response in a different record format, which need to be parsed + re-shaped to keep our backend handlers from blowing up.

Original (longer) writeup follows below.

.... We discovered recently that the response structs for the NYC Geoclient API returned in cases of multi-building esult sets aren't presented quite as gracefully as we'd like.

By "multi-building" we mean cases where an address resolves to multiple building entries (perhaps with different BINs and/or BBLs, but not necessarily), which appear to occur in some 1-3 percent of overall searches.

This situation by itself is someone disconcerting because in principle, addresses should uniquely resolve to distinct BINs in all but degenerate or erroneously entered cases, yet we're seeing them, for example, in large property developments in Queens where it's pretty clear, e.g. from looking at OpenStreetMap or our own HPD registrations table, that the addresses should resolve into separate buildings.

For example, the following search returns a dict 326 keys, instead of the usual 141-143 or so, to accommodate the fact that the GIS backend to the Geoclient API seems unable to disambiguate the address from the larger multi-building development that it's a part of:

python tests/test-nycgeo.py --addr="81-18 Langdale St, Queens" > langdale.json

So it need to return a "list" of property recs for each building it thinks might be associated with that address. By itself, this wouldn't be a big problem for us -- if the data were presented smartly, e.g. as an embedded list within the larger dict struct.

Unfortunately it doesn't do that, and instead simply "layers" the additional property fields in the main response struct, by simply appending an offset to each field name, as if it were some kind of big crosstab (as one in fact often sees out in spreadsheet-land).

That is, it simply stuffs the main response dict with fields like

"giStreetCode6": "45284001", "giStreetCode21": "45284001", "giStreetName14": "LANGDALE STREET", "giStreetName19": "LANGDALE STREET", "giStreetName8": "LANGDALE STREET", "giStreetCode17": "45284001", "giStreetCode5": "45284001",

and so forth, repeated over 9 different field names (simlarly prefixed by "gi-"), for all 21 matching buildings (or 189 additional dict keys total).

Needless to say they picked a very unfortunate and frustrating way to present the data -- so our task is to refactor our proxy agent to map these keys more sensibly (i.e. as an embedded list-of-dicts struct).

wstlabs commented 8 years ago

Turns out we don't need this -- the Geoclient responses are confusing, but ultimately they do provide a primary BIN (in addition to "related" BINs) described above, and as best we can tell, it suffices to look only at the primary BINs (and there should be no need to mess with the related BINs).

Nonetheless the code to do this pivoting is being kept in the tree (under nycgeo/utils/pivot.py I believe) in case it might be useful at some point.