socrata / datasync

Desktop / Console application for updating Socrata datasets automatically.
http://socrata.github.io/datasync/
MIT License
78 stars 33 forks source link

LocationColumn limitations; Geocoding via DataSync #131

Open johnrager opened 8 years ago

johnrager commented 8 years ago

We're looking into enhancing our automated refresh process to start taking advantage of DataSync’s SDK rather than the SODA 2 library we currently rely on. We’ve run into an issue that has pretty-much stopped us in our tracks, related to how geocoding of address fields is handled in the DataSync SDK. The support issue thread is: https://support.socrata.com/hc/en-us/requests/14390.

From what we understand we have two options:

  1. Use the DataSync SDK synthetic location object to build a location using street address, city, state, zip. This won’t work for many of our refreshes because they don’t necessarily fit into the strict four-field format. For example, we have datasets where street number and street name are split into two fields, or the state column is not present in the dataset and is assumed to be “NY”.
  2. Programmatically build and append a location column to the refresh CSV prior to submitting it to DataSync – this is the way our automated process currently does it. We have a format string assigned to any dataset that requires geocoding, which when combined with the data by our process results in either a lat/long pair or address which is appended to the CSV.

Because of the limitation we ran into with option 1, we’ve been pursuing option 2 but have run into a problem. It appears DataSync is much stricter with its geocoding and we’ve run into addresses that have actually caused the entire refresh process to fail. If we run the same data through either the web interface or through our existing SODA 2 refresh process, the entire refresh runs but some rows just don’t get geocoded. This is expected. If we run the file through DataSync, it fails completely as soon as it hits the first bad address.

We tried testing via DataSync with “Set aside errors” turned on and the process completed but the problem rows were excluded from the dataset. This isn’t workable from our perspective. We can’t have rows missing just because an address didn’t geocode, and with the number of datasets we have we can’t distribute problem reports to data owners asking them to correct addresses and resubmit. We need DataSync to handle geocoding just like the web interface and SODA 2 does.

We’d really like to make DataSync more of a part of our operation, but we don’t think we can unless we have a more workable way to handle geocoding. We’re pretty-much dead in the water on this right now.

johnrager commented 8 years ago

Would like to add a thought on this that might get us and other customers just the flexibility we need: Add another switch "Ignore geocoding failures" to the GUI and SDK governing whether the inability to geocode and address should be considered an "error" or not. If set "on", then just set the Location column for that row to null and continue. If set "off", then treat it as an error and let "Set aside errors" govern what to do next.

levyj commented 8 years ago

We do not use Socrata geocoding much so I do not necessarily have too much of a stake in this but I like that suggestion.

Where I do have a stake is to ask that Socrata be careful about any new features breaking existing processes or workflows. Sometimes, when flags have changed before, it has been in ways that were not fully backwards compatible.