pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

re-enable support for custom multi-word synonyms #457

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

As discussed in https://github.com/pelias/schema/issues/456, the work in https://github.com/pelias/schema/pull/453 had the unexpected consequences of dropping support for multi-word custom synonyms.

My general guidance here is that multi-word synonyms are poorly supported by lucene/elasticsearch and so should be avoided where possible, great care should be taken to ensure they are compatible with the match_phrase queries used by Pelias.

this new elasticsearch doc explains the issues with multi-word synonyms with accompanying graphics: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/token-graphs.html

Where possible I'd recommend using 'aliases' ie. doc.setNameAlias() instead, this is a reliable method of achieving the same thing, although it's far less convenient because it's on a per-record basis.

So.. having said all that.. this PR re-enables support for multi-word synonyms (in custom synonyms files only) in order to avoid breaking backwards compatibility.

In order to do this I had to move the custom multi-word synonyms outside the multiplexer, apparently multiplexers emit tokens one-by-one to each of their branches, preventing the ability to 'look-ahead' as required by any multi-term analysis within a branch.

Screenshot 2020-08-04 at 10 21 15

resolves https://github.com/pelias/schema/issues/456

missinglink commented 4 years ago

I've added https://github.com/pelias/schema/pull/457/commits/1baeadfe9529e8ede8207c8a688ee5459efe3c48 to address an issue where the linter is was using /\s/ instead of [\\s/\\\\-]+ to determine which tokens were multi-word.

As a result I've had to remove a few hyphenated synonyms from the canonical synonym lists.

orangejulius commented 4 years ago

Nice. Does it make sense to add an integration test for multi-word synonyms?