openaddresses / openaddresses-ops

Issues-only repo for discussion of operational considerations for OA
6 stars 5 forks source link

Acceptance tests for sources to ensure continuity of valid data feeds #9

Open riordan opened 8 years ago

riordan commented 8 years ago

Some times source datafeeds break. With ESRI servers, it's pretty common for a layer, IDed by a number, to have its number reassigned, and for invalid address data to come back.

It would be great if there were a way to create a few valid tests for each data source as part of each source configuration; known addresses that should be in the dataset each time, and if they're missing or malformed, are to be investigated.

[Please feel free to rewrite this ticket once a stronger specification is agreed upon]

migurski commented 8 years ago

I like this idea. My first thought about implementation might take the form of a set of named expected outputs with low-precision geolocations, e.g.:

{
  …
  "expected": {
    "city hall":
    {
      "NUMBER": "1",
      "STREET": "Dr Carlton B Goodlett Place",
      "POSTCODE": "94102",
      "LAT": 37.7793,
      "LON": -122.4188
    },
    "one market":
    {
      "NUMBER": "1",
      "STREET": "Market Street",
      "POSTCODE": "94105",
      "LAT": 37.7939,
      "LON": -122.3949
    }
  }
}
NelsonMinar commented 8 years ago

I like the idea too! I was going to suggest something even simpler, like "has more than 1000 rows" and "has an address with Market Street in the name". @migurski's proposal allows for more precise tests but requires more typing.

riordan commented 8 years ago

One other possibility (though it wouldn't work for all sources) is identifying if there's a "Full Address" field and using it to test if all the components we've identified are there.