pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

require "analyzer", "search_analyzer" and "normalizer" to be set for all stringy fields #412

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

This is something I've been wanting to do for a while, it explicitly sets the analyzer and search_analyzer property for all stringy fields.

I've included a tool with this PR which can be used to list all the fields and their corresponding analyzers.

The default behaviour of elasticsearch is to default all analyzers to standard when not defined and to default search_analyzer to equal analyzer when it's defined.

So while it's not totally necessary to define an explicit search_analyzer when it's equal to the analyzer I have made this mandatory and covered it with tests, this ensures that it is considered when adding new fields or adapting existing ones.

I set out to try and do this as a no-op refactor (basing the search_analyzer settings on what's in the defaults for pelias/api), however, I discovered fairly quickly that the existing config is sub-optimal and so I've made the following changes:

 field                            type                             analyzer                  search_analyzer           normalizer
-source                           keyword                          n/a                       n/a                       {none}
-layer                            keyword                          n/a                       n/a                       {none}
-name.*                           text                             peliasIndexOneEdgeGram    peliasIndexOneEdgeGram    n/a
-phrase.*                         text                             peliasPhrase              peliasPhrase              n/a
+source                           keyword                          n/a                       n/a                       peliasKeyword
+layer                            keyword                          n/a                       n/a                       peliasKeyword
+name.*                           text                             peliasIndexOneEdgeGram    peliasQuery               n/a
+phrase.*                         text                             peliasPhrase              peliasQuery               n/a
 address_parts.name               text                             keyword                   keyword                   n/a
 address_parts.unit               text                             peliasUnit                peliasUnit                n/a
 address_parts.number             text                             peliasHousenumber         peliasHousenumber         n/a
 address_parts.street             text                             peliasStreet              peliasStreet              n/a
 address_parts.cross_street       text                             peliasStreet              peliasStreet              n/a
 address_parts.zip                text                             peliasZip                 peliasZip                 n/a
 parent.continent                 text                             peliasAdmin               peliasAdmin               n/a
 parent.continent_a               text                             peliasAdmin               peliasAdmin               n/a
-parent.continent_id              keyword                          n/a                       n/a                       {none}
+parent.continent_id              keyword                          n/a                       n/a                       peliasKeyword
 parent.ocean                     text                             peliasAdmin               peliasAdmin               n/a
 parent.ocean_a                   text                             peliasAdmin               peliasAdmin               n/a
-parent.ocean_id                  keyword                          n/a                       n/a                       {none}
+parent.ocean_id                  keyword                          n/a                       n/a                       peliasKeyword
 parent.empire                    text                             peliasAdmin               peliasAdmin               n/a
 parent.empire_a                  text                             peliasAdmin               peliasAdmin               n/a
-parent.empire_id                 keyword                          n/a                       n/a                       {none}
+parent.empire_id                 keyword                          n/a                       n/a                       peliasKeyword
 parent.country                   text                             peliasAdmin               peliasAdmin               n/a
 parent.country_a                 text                             peliasAdmin               peliasAdmin               n/a
-parent.country_id                keyword                          n/a                       n/a                       {none}
+parent.country_id                keyword                          n/a                       n/a                       peliasKeyword
 parent.dependency                text                             peliasAdmin               peliasAdmin               n/a
 parent.dependency_a              text                             peliasAdmin               peliasAdmin               n/a
-parent.dependency_id             keyword                          n/a                       n/a                       {none}
+parent.dependency_id             keyword                          n/a                       n/a                       peliasKeyword
 parent.marinearea                text                             peliasAdmin               peliasAdmin               n/a
 parent.marinearea_a              text                             peliasAdmin               peliasAdmin               n/a
-parent.marinearea_id             keyword                          n/a                       n/a                       {none}
+parent.marinearea_id             keyword                          n/a                       n/a                       peliasKeyword
 parent.macroregion               text                             peliasAdmin               peliasAdmin               n/a
 parent.macroregion_a             text                             peliasAdmin               peliasAdmin               n/a
-parent.macroregion_id            keyword                          n/a                       n/a                       {none}
+parent.macroregion_id            keyword                          n/a                       n/a                       peliasKeyword
 parent.region                    text                             peliasAdmin               peliasAdmin               n/a
 parent.region_a                  text                             peliasAdmin               peliasAdmin               n/a
-parent.region_id                 keyword                          n/a                       n/a                       {none}
+parent.region_id                 keyword                          n/a                       n/a                       peliasKeyword
 parent.macrocounty               text                             peliasAdmin               peliasAdmin               n/a
 parent.macrocounty_a             text                             peliasAdmin               peliasAdmin               n/a
-parent.macrocounty_id            keyword                          n/a                       n/a                       {none}
+parent.macrocounty_id            keyword                          n/a                       n/a                       peliasKeyword
 parent.county                    text                             peliasAdmin               peliasAdmin               n/a
 parent.county_a                  text                             peliasAdmin               peliasAdmin               n/a
-parent.county_id                 keyword                          n/a                       n/a                       {none}
+parent.county_id                 keyword                          n/a                       n/a                       peliasKeyword
 parent.locality                  text                             peliasAdmin               peliasAdmin               n/a
 parent.locality_a                text                             peliasAdmin               peliasAdmin               n/a
-parent.locality_id               keyword                          n/a                       n/a                       {none}
+parent.locality_id               keyword                          n/a                       n/a                       peliasKeyword
 parent.borough                   text                             peliasAdmin               peliasAdmin               n/a
 parent.borough_a                 text                             peliasAdmin               peliasAdmin               n/a
-parent.borough_id                keyword                          n/a                       n/a                       {none}
+parent.borough_id                keyword                          n/a                       n/a                       peliasKeyword
 parent.localadmin                text                             peliasAdmin               peliasAdmin               n/a
 parent.localadmin_a              text                             peliasAdmin               peliasAdmin               n/a
-parent.localadmin_id             keyword                          n/a                       n/a                       {none}
+parent.localadmin_id             keyword                          n/a                       n/a                       peliasKeyword
 parent.neighbourhood             text                             peliasAdmin               peliasAdmin               n/a
 parent.neighbourhood_a           text                             peliasAdmin               peliasAdmin               n/a
-parent.neighbourhood_id          keyword                          n/a                       n/a                       {none}
+parent.neighbourhood_id          keyword                          n/a                       n/a                       peliasKeyword
 parent.postalcode                text                             peliasZip                 peliasZip                 n/a
 parent.postalcode_a              text                             peliasZip                 peliasZip                 n/a
-parent.postalcode_id             keyword                          n/a                       n/a                       {none}
+parent.postalcode_id             keyword                          n/a                       n/a                       peliasKeyword
 center_point                     geo_point                        n/a                       n/a                       n/a
 shape                            geo_shape                        n/a                       n/a                       n/a
 bounding_box                     keyword                          n/a                       n/a                       {none}
-source_id                        keyword                          n/a                       n/a                       {none}
-category                         keyword                          n/a                       n/a                       {none}
+source_id                        keyword                          n/a                       n/a                       peliasKeyword
+category                         keyword                          n/a                       n/a                       peliasKeyword
 population                       long                             n/a                       n/a                       n/a
 popularity                       long                             n/a                       n/a                       n/a
 addendum.*                       keyword                          n/a                       n/a                       {none}
missinglink commented 4 years ago

I would like to change the search_analyzer for the address_parts.* and parent.* fields but we can do that in another PR, this one is meant to be as much as a no-op as possible.

I believe the existing search_analyzer for address_parts.street is doing more work at query-time than required and likely returning more hits than we need.

missinglink commented 4 years ago

ugh ok, I force-pushed an update to fix the failing integration tests. the keyword datatype uses a 'normalizer' in place of an 'analyzer' so I've added one of those, as a result, the ID fields and source/layer fields are now case-insensitive when they were previously case-sensitive.

missinglink commented 4 years ago

this PR is partially failing on ES5 due to the trim filter not being supported in a normalizer until Lucene 7.3 (Elastic v6.4)

This PR ended up growing quite large and complex with many different changes all in one go. I'm going to split this up into smaller PRs so that things can be reviewed and merged individually.