openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

There are a lot of addresses with blank or "0" house number #240

Open trescube opened 8 years ago

trescube commented 8 years ago

In the latest US/CA data, there are a lot of addresses with either 0 or blank house numbers. I may be wrong but in my experience with street geocoding at MapQuest, "0" is not a valid house number in the US/CA (though I've read that "0" is valid in some European countries). Are these supposed to represent streets in general and not a particular house number or are these bad data?

House Number/Country US CA
5,274,294 7,042
0 1,457,417 240,970
geobrando commented 8 years ago

@stephenkhess I noticed this in a lot of the Georgia sources that were recently added. I would say that these are all likely pulled in directly from the source data and represent artifacts introduced by the data owners and not ones introduced by any processing done by machine. They likely represent missing street numbers and should be empty instead of 0. It's another good example of something that could be handled by some future QA.

See https://github.com/openaddresses/openaddresses/pull/1251

trescube commented 8 years ago

The top 10 states for 0 house numbers are:

State Count
ca 541730
ga 281365
tx 192020
ma 131522
al 86555
wa 55452
ok 37299
nv 36700
id 29204
fl 25492
migurski commented 8 years ago

I think this is the source data. Looking at San Francisco for example, I see 1,661 rows with 0.0 ADDR_NUM values (they are stored as floating point). By way of contrast, Alameda County stores its ST_NUM values as strings, and contains no zeros.

So, I think this might be a mistake on the part of the data publisher. Should we treat numeric zeros as missing values? @stephenkhess is there enough knowledge about what’s valid in a place like California to add information to sources that would treat zeros as missing data?

migurski commented 8 years ago

@feomike provided some insight on this after our phone call with @iandees and Mapzen folks earlier in the fall. I’m going to open up a discussion in ops, to see if there’s any consistency of opinions on this.