openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

Handle case where pre- and -post directional are same #784

Open missinglink opened 3 years ago

missinglink commented 3 years ago

Heya,

I noticed a street name in the San Diego file today "S 39TH ST S" which has the "South" directional added twice:

cat us_ca_san_diego-addresses-county.geojson \
  | grep 'S 39TH ST S' \
  | jq '.properties.street'

"S 39TH ST S"

It seems that the error is caused by the source data including both pre (addrpdir) and post (addrpostd) directional columns with the value 'S':

ogr2ogr -f CSV /vsistdout/ addrapn_datasd.dbf \
  | xsv search -s 'objectid' '854155' \
  | xsv table

objectid  addrnmbr  addrfrac  addrpdir  addrname  addrpostd  addrsfx  addrunit  addrzip  add_type  roadsegid  apn         asource  plcmt_loc  community  parcelid  usng
854155    1261                S         39TH      S          ST                 92113              0          5512003800  M        C          SAN DIEGO  11648     11S MS 89683 17286

Would it be possible to add a check in machine which only adds one of these values to the street field when both are present?

iandees commented 3 years ago

🤔 Are these one-offs in the data set? Maybe we should ask the county to fix the data?

missinglink commented 3 years ago

It's definitely uncommon in OA, at least I've never noticed it before. Within this one file happens a lot:

ogr2ogr -f CSV /vsistdout/ addrapn_datasd.dbf \
  | awk -F, '{ if($4 && $4==$6) {print $0}  }' \
  | xsv count

3595

Looking at the source, it could also be that addrpdir isn't what we think it is? The post field is named addrpostd, I would expect the pre to be called addrpred but it's called addrpdir 🤷‍♂️.

missinglink commented 3 years ago

It might still be a good idea to add some logic in machine to catch this

I think whenever the pre and post directional are identical it should always be considered an error? Only one directional string should be added to the street string in this case.

[edit] If I were to chose which one, I'd favour keeping the post since it's much easier for consumers of the data to detect post-directionals than pre-directionals.

missinglink commented 3 years ago

FWIW there are other logical errors in the San Diego geojson file, also because the source file is messy.

One thing I noticed is that machine inserts a space when the field is empty, so in these cases where there is no addrsfx we see a double space.

cat us_ca_san_diego-addresses-county.geojson \
  | jq -r '.properties.street' \
  | grep -E '^[NSEW]\s.{1,3}\s\s[NSEW]$'

W E  W
W E  W
E AVE  E
W E  W
W E  W
E AVE  E
E AVE  E
W E  W
E AVE  E