pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

Add filter to remove leading zeros to `peliasPhrase` analyzer #475

Closed orangejulius closed 3 years ago

orangejulius commented 3 years ago

This change adds the removeAllZeroNumericPrefix filter to the peliasPhrase analyzer. The idea is to help ensure postalcodes with a leading zero can show up in autocomplete queries, as reported in https://github.com/pelias/pelias/issues/898.

The change is based on two assumptions:

I recall the original motivation for removing leading zeros is that we sometimes see street names like 05th avenue in various data sources (or possibly queries), and we want to allow that to match on 5th avenue.

The original code to do this seems to have been written pretty long ago so it's hard to say for sure.

Anyway, assuming we do want to handle those cases, it seems like removing leading zeros everywhere will allow us to handle postalcodes that start with zero, of course with some downsides: the leading zeros are ignored completely, so we cant distinguish between 01000 and 1000, which might be valid postalcodes or housenumbers, for example. This can lead to cases where clearly incorrect results come up, like 1000 main street matching a request for a hypothetical 01000 postalcode. But I think it's the best we can do without a bunch more work.

I tested this code with a global set of postalcodes and it does allow the relevant postalcodes to match.

Assuming this is the best idea anyone else has we can move forward with testing this PR on a full planet build and going from there.

Fixes https://github.com/pelias/pelias/issues/898

orangejulius commented 3 years ago

Full planet build looks good, our testing shows no regressions introduced, but it does allow finding postalcodes that start with a leading zero! 🥳

Look at us go with all these schema changes lately ⏩