Open missinglink opened 4 years ago
Is this normalizer necessary? The API is already doing the lowercase transformation for layer
and source
(category
too ?) and there are also some check for ids :thinking:
The code looks good :smile:
Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized.
This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api
code.
It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.
I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷♂
[admission of guilt] it was written by me 😝
Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p
cherry-picked from https://github.com/pelias/schema/pull/412 and based on https://github.com/pelias/schema/pull/414.
This PR adds a
normalizer
which is the nearest thing to ananalyzer
forkeyword
fields. more info here: https://github.com/elastic/elasticsearch/issues/18064This allows us to perform some basic normalization to fields such as
layer
,source
andcategory
, forcing them to be lowercased and doing some ICU normalization.One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.
Note that not all
keyword
fields should have a normalizer specified, for instance, verbatim fields such asbounding_box
andaddendum
are probably best left with the defaultnull
normalizer.Normalizers are applied both at index-time and at query-time.
I would like to add some additional filters such as
trim
andunique
but they are not available until version6.4
of elasticsearch and so will come in a subsequent PR which can be merged independently of this.