add improved algorithm for street name normalization

missinglink commented 3 years ago

For a long time we've had issues with openaddresses data being provided in a mix of lettercasing/abbreviated forms.

Looking at the NYC file for instance there is still a lot of work we need to do on the data to get it up to a comparable level to what we see from other providers:

56 PL                                            |  56 Place
VETERANS AVE                                     |  Veterans Avenue
WASHINGTON SQ  S                                 |  Washington Square South
CLARKSON AVE                                     |  Clarkson Avenue
E  241 ST                                        |  East 241 Street
MATTHEWS AVE                                     |  Matthews Avenue
FORDHAM ST                                       |  Fordham Street
E  139 ST                                        |  East 139 Street

We have an existing algorithm which works fairly well, but notably it doesn't handle abbreviation expansion, the result of which is that we often have places showing up twice in search, once in the abbreviated form and once in the expanded form:

This is annoying for several reasons:

It looks terrible when there's no consistency in abbreviations
It makes deduplication much harder
It makes tests more brittle and writing assertions more difficult

So I've had a crack at a few different algorithms, in the end settling on one which is powerful, yet also fairly safe. I'm going to start with english speaking languages first and hope to allow a framework to expand to other languages later.

The process itself is fairly fraught with dangers which we will need to test before we're confident that this isn't going to make things uglier in places.

Some considerations I've made:

Synonym contraction is much easier and less error prone than expansion, but when I tried it out it just looked terrible from a product perspective, heavily contracted labels make me feel like I'm communicating on instant messenger with a teenager 😭 .
Attempting to correctly case data which all comes all uppercase is near impossible. I favoured a simple approach:
- If a token is mixed-case just leave it as-is otherwise you'll never get things like "MoT" back how they should be.
- If a token is uppercased then attempt to detect if it's an abbreviation, some things we can consider here are whether it ends with a . or whether it contains vowels, eg. "JFK" or "MLK" should remain uppercased.
- Shift uppercased tokens to lowercase and then later apply capitalisation to all lowercase tokens.
There needs to be control over whether we do synonym substitutions starting from the right or the left and also a limit to how many tokens we replace or how far we iterate from the beginning. Without these rules we might end up going nuts and replacing way too much.

From here I'd like to move to a testing phase where I test this against a large US area and see what errors I can catch before we look at merging.

orangejulius commented 3 years ago

Cool, overall this looks great, and if the example addresses you gave are representative of the actual changes from this PR, it's probably looking good.

Going forward and generating similar output for a large part, or all of the US, is a perfect next step. 👍

missinglink commented 3 years ago

I've pushed another commit which adds the ability to test the old analysis and the new analysis side-by-side:

Working off the feedback I got from that, I've gone and made the code/dictionaries more conservative, while still covering a good 90%+ of the cases (I think!?).

Two major difference from the last version, both in the name of reducing potential error:

The isLikelyAbbreviation() method now returns false by default instead of true, this is more similar to the old behaviour and means that tokens like MLK become Mlk, this isn't ideal but it's what we were doing before 🤷‍♂️
I've stopped using the libpostal street_types dictionary, it's just too much for what we want and has the potential to cause a bunch of unforseen issues, so I replaced it with one defined by the USPS which we use elsewhere.
Paired down the other dictionaries to remove a bunch of things I've never heard of
Added a bunch more test cases.

The side-by-side diff can be run as-such:

tail -n +2 /data/oa/us/ny/city_of_new_york.csv \
  | cut -d',' -f4 \
  | node test/analysis.js

echo "30 w 26 st" | node test/analysis.js

missinglink commented 3 years ago

@orangejulius I would love to have more visibility on this feature, I was going to store the original STREET field in the addendum but that would probably blow up the index a fair bit huh, since there's so many OA records?

I also considered logging at index time, but those logs will either be super annoying (on a high log level) or ignored (on a lower log level).

What do ya think, should we store the original in the addendum for quick debugging? I was considering only storing it in certain cases, such as when the character count changed, even then it would be a lot of data..

orangejulius commented 3 years ago

Storing this data in the addendum would probably not work for indices we want to get good performance out of. There's no reason we can't generate one for testing if it might be useful though.

Logging also isn't super useful, like you said we'd probably ignore it. Personally I think what you have now where we can generate the logs manually if needed is best.

Joxit commented 3 years ago

This works pretty well for English but will be a bit more difficult for other languages :thinking:

I know only one tricky example and it's working: St Patrick St

missinglink commented 3 years ago

Yeah Mount St John Avenue is also difficult, I think there needs to be some logic which first checks how bad the source data is before apply a rule such as "expand max one street suffix abbreviation".

orangejulius commented 3 years ago

@missinglink I think you mentioned it already, but does this code have a concept of "abbreviations only found at the end" and likewise "abbreviations not found at the end" of a street?

For example, st anywhere but the end of a street name is likely Saint. At the end, it's Street. ave on the otherhand, is Avenue everywhere.

Also, another important exception to consider: the east-west letter streets of Washington, DC. S Street there should not be expanded to South Street, for example.

missinglink commented 3 years ago

Yeah St Patrick St is currently normalised to Saint Patrick Street which is correct ...although I'd probably prefer St Patrick Street, that's a matter of preference.

I think the first version of this feature we release will only target suffix abbreviations in the last token position, this will cover a huge amount of cases and make a huge difference to results while also being fairly safe.

Some examples of where that won't help:

Where the suffix is several tokens away from the last one: WHITESTONE EXPY SR W
Multiple expansions required: W FARMS SQ PLZ

We can likely adopt a similar approach to directionals, where we only expand them in the first position, although I'm also happy to punt this for the first version as IMO directionals look better contracted and that seems more in-line with what I'm seeing from other data sources/services 🤷‍♂️

missinglink commented 3 years ago

My learnings from spending a day or two on this:

It's helpful to understand how the source data is cased and how heavily it's contracted.

The nz/countrywide.csv file can be left as-is since it's lettercased correctly and expanded/contracted in an aesthetically pleasing and consistent way. The us/ny/city_of_new_york.csv file on the other hand is all uppercase and heavily contracted. These files should be treated differently, or at very least running the algorithm on nz/countrywide.csv should result in a no-op.

It's impossible to correctly lettercase data which comes all uppercase

The reality is that MLK is always going to be incorrectly cased Mlk unless we maintain a dictionary of such abbreviations ('MC' is also an issue here). I experimented with attempting to detect abbreviations but even something like assuming short inputs with no vowels as an abbreviation will leave less common street suffixes such as 'PY' in uppercase, which looks terrible.

Abbreviations don't always appear in token positions you expect them to appear in

It would be really interesting to run a classifier against a large corpus (such as USA) to detect naming patterns, presumably there are very few. I'd guess that the 'standard pattern' we think of for US addresses {specific} {generic} ie. Foo St represents about 50% of the streets in the USA. Other patterns such as {directional} {specific} {generic} and {specific} {generic} {directional} ie. North Foo Street likely represent a good percentage. From there it's limited gains, the risk of mangling the name increases with patterns like {specific|prefix} {generic} {specific|prefix} ie. St Patricks St all the way down to things like {title} {specific} {specific} {title} {title} {generic} ie. DR M L KING JR BLVD. That's all a long way of saying that covering the top 10 most common patterns would make a huge impact and ambiguous patterns could be skipped completely. This seems to be what the big mapping companies are doing, if you search the difficult patterns you'll see they don't attempt to expand them, whereas the 'easy' patterns are all in a normalised form. IMO this is where this feature should go in subsequent iterations.

Libpostal dictionaries are very verbose

In many cases the libpostal dictionaries require significant manual pruning and care must be taken when using them. Some examples are center|centre which will incorrectly anglicise/de-anglicise names. Another example is Northe Street which would be incorrectly normalized to North Street, similarly things like ovl|oval and attempts to pluralize things parks|park cause issues. There are also many ambiguous expansions in the libpostal dictionaries where we either need to choose a preference such as line|ln lane|ln, or just delete them entirely to avoid error, ie. north|no No 1 St.

Testing is key

The only real way of catching errors is to run the code against multiple different OA files from regions with the same general conventions but from different quality levels. I caught so many unforeseen issues, many mentioned above using this approach.

and finally...

Have a clear idea of what you'd like the output to look like

Originally I set out to contract all the tokens, this seemed to be the best approach since it would produce fewer errors compared to expanding tokens. What I found very quickly is that the output just looks crap, it's very terse and doesn't look good from a product perspective. A better option is to find a file, like nz/countrywide.csv which has some aethetically pleasing conventions and then decide what the goal is. For instance I feel like expanding the generic portion of the street (the 'suffix') is better, however I think that contracting directionals is more aesthetically pleasing (while also being less error-prone). I'm still not 100% sure about how I feel about 'Mt' (mount), 'Ft' (fort), 'St' (saint), 'Dr' (doctor) etc, I'm swaying towards contracting them.

It's actually a very difficult problem to solve, and even more so when it comes to other languages/geographies. For now I'm going to try and only target the most common patterns and err on the side of caution.

missinglink commented 3 years ago

OK this is ready for wider testing, I don't intend on adding any more commits other than tests, or to fix any reported errors.

missinglink commented 3 years ago

rebased origin/master to get tests passing on GH Actions instead of Travis

missinglink commented 3 years ago

As a note from code review, we noticed that the "remove leading 0 housenumber" functionality is still present, although only applying to the street field, which seems odd.

Turns out it's ok: https://github.com/pelias/openaddresses/pull/26

pelias / openaddresses