pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

Explicitly set analyzers #414

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

cherry-picked from https://github.com/pelias/schema/pull/412 and based on https://github.com/pelias/schema/pull/413. view diff

This is something I've been wanting to do for a while, it explicitly sets the analyzer and search_analyzer property for all text fields.

The default behaviour of elasticsearch is to default all analyzers to standard when not otherwise defined.. ... and to default the search_analyzer property to equal analyzer.

So while it's not totally necessary to define an explicit search_analyzer when it's equal to the analyzer I have made this mandatory and covered it with tests, this ensures that it is considered when adding new fields or adapting existing ones.

This is hopefully a no-op refactor (basing the search_analyzer settings on what's in the defaults for pelias/api). It will benefit any queries where the analyzer was not set on the query for whatever reason.

 field                            type                             analyzer                  search_analyzer           normalizer
< name.*                           text                             peliasIndexOneEdgeGram    peliasIndexOneEdgeGram    n/a
< phrase.*                         text                             peliasPhrase              peliasPhrase              n/a
---
> name.*                           text                             peliasIndexOneEdgeGram    peliasQuery               n/a
> phrase.*                         text                             peliasPhrase              peliasQuery               n/a
missinglink commented 4 years ago

I know there are some queries such as this where the analyzer is not being set.

It's hopefully not a big deal right now because that query doesn't target the name.* and phrase.* fields which require a different search_analyzer from the index-time analyzer.

In the future, I'd like to explore having a new search_analyzer for address_parts.street since right now it's doing a bunch of synonym substitutions at query-time which can likely be avoided.

missinglink commented 4 years ago

Also worth noting the parent.{placetype}.ngram fields were previously using an ngram analysis at query-time which is just plain wrong.

missinglink commented 4 years ago

This looks good to me. It should provide an immediate performance benefit for queries targeting the admin ngrams fields as well as any queries which were failing to correctly use the peliasQuery analyzer.