pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

add normalizer for keyword fields #415

Open missinglink opened 4 years ago

missinglink commented 4 years ago

cherry-picked from https://github.com/pelias/schema/pull/412 and based on https://github.com/pelias/schema/pull/414.

This PR adds a normalizer which is the nearest thing to an analyzer for keyword fields. more info here: https://github.com/elastic/elasticsearch/issues/18064

This allows us to perform some basic normalization to fields such as layer, source and category, forcing them to be lowercased and doing some ICU normalization.

One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.

Note that not all keyword fields should have a normalizer specified, for instance, verbatim fields such as bounding_box and addendum are probably best left with the default null normalizer.

Normalizers are applied both at index-time and at query-time.

I would like to add some additional filters such as trim and unique but they are not available until version 6.4 of elasticsearch and so will come in a subsequent PR which can be merged independently of this.

Joxit commented 4 years ago

Is this normalizer necessary? The API is already doing the lowercase transformation for layer and source (category too ?) and there are also some check for ids :thinking:

The code looks good :smile:

missinglink commented 4 years ago

Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized. This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api code.

It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.

missinglink commented 4 years ago

I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷‍♂

[admission of guilt] it was written by me 😝

Joxit commented 4 years ago

Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p