openaddresses / openaddresses

A global repository of open address, building, and parcel data.
http://openaddresses.io/
BSD 3-Clause "New" or "Revised" License
2.85k stars 855 forks source link

discussion: housenumber extraction regexen #2075

Open missinglink opened 8 years ago

missinglink commented 8 years ago

hey all, I came across this issue when investigating https://github.com/openaddresses/openaddresses/issues/2070

it seems like most of the regexen (is that the British form of regexes? :)) don't support alphanumeric house numbers (eg. 1a) or address ranges (eg. 1-10).

the most common form of regex seems to be ^([0-9]+) which would probably be better off written as something like ^([0-9]+[a-zA-Z]?|[0-9]+-[0-9]+).

as a result the street regi can end up doing unexpected things such as this:

-74.035221,40.7429476,416-18 GRAND ST,416-18 Grand Street,,,,,,,

I can go ahead and fix it for US/NJ but it seems to also affect a bunch of other files.

to match allthethings we could use the following; which should leave the street name unaffected in the case where a house number match was not possible:

"416-18 GRAND ST".match(/^(([0-9]+[a-zA-Z]?|[0-9]+-[0-9]+)\s+)(.*)$/)
["416-18 GRAND ST", "416-18 ", "416-18", "GRAND ST"]

the matches would then always be $2 for housenumber and $3 for street name
note: the ordering would need to be flipped for Germanic addresses

thoughts?

$ find sources -type f -iname "*.json" | xargs grep \"pattern\" | cut -d\" -f4 | sort | uniq -c | sort -n -r
    376 ^(?:[0-9]+ )(.*)
    311 ^([0-9]+)
     67 ^([0-9]+)( .*)
     49 ^.* ((Unit|Apt) [0-9A-Za-z])$
     16 ^(?:\\S+ )(.*)
     15 ^(\\S+)
      8 ([0-9]*) (.*)
      6 function
      5 ^(?:[0-9]+ )?(.*?),(?: )?(|[^,]+),?(?: )(.*?),(?: )(.*)(?: )([0-9]+)$
      5 ^(?:[0-9]+) (.*)
      3 ([0-9]+)(.*)~(.*), IL ([0-9]+)
... etc
migurski commented 8 years ago

Maybe this is an opportunity for a new function in the core tag set that's a pre-baked, reliable regex for street numbers?

albarrentine commented 8 years ago

Depends on whether you want it to work absolutely everywhere or not. There are many edge cases around the world that differ from those listed above:

My general broken record advice ("just add libpostal!") notwithstanding, a decently inclusive regex for extracting house numbers in countries using the Anglo-American format would be:

^[\s]*((?:[0-9]+[\s]*(?:1\/[234]|2\/3|3\/4|[¼½¾⅓⅔]))|(?:[0-9]+(?:\/?[a-zA-Z])?(?:[\-\/][0-9]+(?:\/?[a-zA-Z]?))*)(?: bis)?|(?:[\d]+(\-[\d]+)*))[^\d](?:.*)$

As written this regex needs to be compiled with the re.UNICODE flag if using Python and the input string needs to be Unicode as well. It handles all of the cases listed above as well as numbers with a few common fractional components ("5 1/2" or "5½"), can still match in cases where the street might be delimited from the number with something other than whitespace e.g. "5,Main Street" or the like. It also handles the French term "bis" which is sort of like saying "123B" and comes up fairly often in French-speaking countries. There may be a few more phrases like that but that's the only one that's coming to mind at the moment.

There are still plenty of addresses that a regex, any regex, can get wrong i.e. in the case of something like "8 street" it would still match the "8" (a statistical tagger like libpostal can handle most of those case using the surrounding context).

For most of the rest of the world where house number comes after street name, the reverse should also work reasonably well (same caveats as above i.e. will match the "8" in "Calle 8"):

^(?:.*?)[^\d]((?:[0-9]+[\s]*(?:1\/[234]|2\/3|3\/4|[¼½¾⅓⅔]))|(?:[0-9]+(?:\/?[a-zA-Z])?(?:[\-\/][0-9]+(?:\/?[a-zA-Z]?))*)(?: bis)?|(?:[\d]+(\-[\d]+)*))[\s]*$
trescube commented 8 years ago

@thatdatabaseguy I've been sitting on this for awhile but haven't gotten around to write up yet but this seems like as good as place as any to start. Incorporating libpostal into OA is something we've been tossing around for awhile so I wrote up some tests to figure out how well it does at teasing apart house number and street for OA sources. My test data is from the San Francisco source since it has both concatenated house number and street along with those individual fields.

Overall libpostal it does pretty well, it parses ~99% correctly (I don't have actual numbers yet since the source data contains dupes). When it doesn't parse the input correctly, it mainly falls incorrectly identifies the house number as postal code or part of the road as a suburb. Here are some examples:

> 2235 north point st

Result:

{
  "house_number": "2235",
  "road": "north",
  "suburb": "point",
  "road": "st"
}
> 5 russian hill pl

Result:

{
  "house_number": "5",
  "suburb": "russian hill",
  "road": "pl"
}
> 875 la playa

Result:

{
  "house_number": "875",
  "road": "la",
  "suburb": "playa"
}
> 1338 kobbe ave

Result:

{
  "postcode": "1338",
  "road": "kobbe ave"
}
> 171 south park

Result:

{
  "house_number": "171",
  "road": "south",
  "suburb": "park"
}
trescube commented 8 years ago

For a first pass, regex-wise, to start with I'd personally be thrilled with just a US/CA one that works 99% of the time. There are ~3,140 counties in the US alone and copying around the number/street regexes around gets cumbersome, especially when trying to remember where the ones are for 1/2 and letters-appended-house-number.

albarrentine commented 8 years ago

Ah, heard that might be on the horizon.

Incidentally, OpenAddresses is also being incorporated into the next libpostal release for hopefully handling those sorts of cases (better modeling of road names that might not be in the OSM road network, valid contexts for postal codes). There's something like 3x the training data in OpenAddresses as in OSM, it has more coverage in certain areas, many of the addresses have postal codes, picks up some city names/variants that differ from OSM or are currently stored as points rather than polygons, etc. OSM polygons are still used in the case of blank city names.

Hopefully that doesn't create a feedback loop. The importer has its own checks/validators so it would discard records with a house number like "416-18 GRAND ST".

missinglink commented 8 years ago

@thatdatabaseguy it might be better for you to train on the raw data received from the origin servers (ie. before any regexii have been run against them).

albarrentine commented 8 years ago

That's a good idea. So far the regexes haven't been as much of an issue since many sources are already pre-separated, and the sources that do use regexes are mostly in the US and constrained to a single county/municipality, so said regexes tend to be reasonably accurate.

@migurski is it possible to save CSV versions of the source files with the original columns? Since machine's already handling ESRI sources, shapefile parsing, etc. it might be easier to just save a second pre-conform version.

migurski commented 8 years ago

We do store a cached version of everything, is that what you mean? Check the "cache" column at https://results.openaddresses.io.

albarrentine commented 8 years ago

That's close enough. As long as it converts the ESRI sources to CSV. The rest of the cached files appear to be GeoJSON and shapefiles, which are fine.

migurski commented 8 years ago

Great!

migurski commented 7 years ago

@trescube and I spoke about this today, and we think that a good move forward would be to add a US/Canada function to the core machine library that implements some of the regexes above. He’d like to do the implementation, and I’d help get him started with a working local installation of the machine code.

migurski commented 7 years ago

New prefix_street and prefix_number conform functions are live as of Machine 3.10.0. The next step would be to document them in this repository.

trescube commented 7 years ago

I can handle that

migurski commented 7 years ago

Thanks @trescube! Thanks for your good work on the new functions!