pelias / schema

elasticsearch schema files and tooling
MIT License
40 stars 76 forks source link

"custom_street" synonyms should be applied to name.* and phrase.* indices #449

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

As mentioned in https://github.com/pelias/schema/pull/446#discussion_r440124764

If developers extend the synonyms lists by editing synonyms/custom_street.txt the synonyms are only being applied to the address_parts.street field and not to name.* & phrase.* indices.

The problem with this is that we use the name.* & phrase.* indices as top-level MUST conditions in most of our queries so adding custom street synonyms has very little effect.

The current workaround is for developer to manually add the synonyms filter to the indices as per: https://github.com/Joxit/pelias-schema/commit/514fb027cb5dd3b2be5ad1f4da62db0f24f20951

I think this was an oversight at the time and worth fixing.

orangejulius commented 4 years ago

Oh nice, yeah, this duplication has always been annoying for us, and has lead to much confusion for Pelias users.

missinglink commented 4 years ago

I'm happy to merge this now, but the thing is that the synonyms/custom_street.txt file is empty by default so it's going to be a no-op for anyone running the canonical synonyms which ship with Pelias.

@Joxit 👍 or 👎 ?

Joxit commented 4 years ago

Yep, I'm ok with this :smile: this will help me for future builds !

orangejulius commented 4 years ago

Do we need this PR after #453? I don't think so, but just wanted to confirm. Feel free to close if it's no longer needed.

missinglink commented 4 years ago

This has been resolved in a cleaner way using the name_synonyms_multiplexer

Screenshot 2020-07-14 at 11 43 59
missinglink commented 4 years ago

oh... actually... the multiplexer still doesn't list custom_street.. I think the issue is still valid and unresolved but the code will require some modernisation to include custom_street within name_synonyms_multiplexer.

missinglink commented 4 years ago

I'd actually have preferred to delete custom_* synonym files as part of the recent refactor, from what I've seen of 3rd-party synonyms files they are usually very messy and error-prone.

I didn't drop support for custom_* synonym files because developers are already using them and it would potentially break functionality.

I'm not going to do any further work on this topic but I'd be happy to review and merge a PR from a developer who is actively using custom synonyms and can test it out on their setup.

missinglink commented 4 years ago

This issue came up via a client today, I've modernized the code as per https://github.com/pelias/schema/pull/449#issuecomment-658084758 and rebased it against master.

In addition to including custom_street in the name.* and phrase.* indices, I'm also now including custom_admin for completeness.

I'd like to solicit some feedback before merging this, I suspect it will not have any issues besides the pervasive 'viral synonyms' issue related to multi-word synonyms that we inherited from Lucene and have always had to deal with.

A good example of where this is relevant is "MLK" vs "Martin Luther King" vs "Martin Luther King Jr" which is a common synonym in the United States but also unique enough in a global context to possibly not include in the core Pelias synonyms files.

missinglink commented 4 years ago

@joxit this got put on hold when I was doing the recent synonyms work but is open for testing and merging now.

It's slightly different what what you tested last time because I've rebased it against master to bring in those changes.

We don't use custom synonyms so any feedback you could provide would be helpful to confirm it's good to go.

I was hoping that it would be able to replace https://github.com/Joxit/pelias-schema/commit/514fb027cb5dd3b2be5ad1f4da62db0f24f20951

Joxit commented 4 years ago

In fact, since #453 includes all the synonyms I need, I am no longer using my fork (I did a full build yesterday).

I might need the custom street again someday :man_shrugging: This PR is still OK and custom_street and custom_admin are definitely useful in peliasPhrase

missinglink commented 4 years ago

I confirmed this works as expected by adding the following line to custom_street.txt:

note: no other custom synonyms are enabled

Martin Luther King,ML King,M L King,MLK

And then running an elyzer query against peliasPhrase (which is used on the name. and phrase. fields).

elyzer --index pelias --analyzer peliasPhrase 'MLK Boulevard'
CHAR_FILTER: punctuation
{0:MLK Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:MLK Boulevard}
TOKENIZER: peliasTokenizer
{0:MLK} {1:Boulevard}
TOKEN_FILTER: lowercase
{0:mlk} {1:boulevard}
TOKEN_FILTER: trim
{0:mlk} {1:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:mlk} {1:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:mlk} {1:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:mlk,martin,ml,m} {1:boulevard,luther,king,l} {2:king,king}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:mlk,martin,ml,m} {1:boulevard,luther,king,l} {2:king,king}
TOKEN_FILTER: name_synonyms_multiplexer
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
TOKEN_FILTER: icu_folding
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
TOKEN_FILTER: remove_ordinals
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
TOKEN_FILTER: unique_only_same_position
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
TOKEN_FILTER: notnull
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
TOKEN_FILTER: flatten_graph
{0:mlk,martin,ml,mall,mill,m}   {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}   {2:king,kg}
missinglink commented 4 years ago

The above also highlights the issue with multi-word synonyms where boulevard (and variants) are indexed at position:1 for xxx boulevard and not at position:3 as we would prefer for xxx xxx xxx boulevard.

missinglink commented 4 years ago

The inverse is true for when the data provider gives us Martin Luther King Boulevard:

elyzer --index pelias --analyzer peliasPhrase 'Martin Luther King Boulevard'
CHAR_FILTER: punctuation
{0:Martin Luther King Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:Martin Luther King Boulevard}
TOKENIZER: peliasTokenizer
{0:Martin}  {1:Luther}  {2:King}    {3:Boulevard}
TOKEN_FILTER: lowercase
{0:martin}  {1:luther}  {2:king}    {3:boulevard}
TOKEN_FILTER: trim
{0:martin}  {1:luther}  {2:king}    {3:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:martin}  {1:luther}  {2:king}    {3:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:martin}  {1:luther}  {2:king}    {3:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:martin,ml,m,mlk} {1:luther,king,l}   {2:king,king}   {3:boulevard}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:martin,ml,m,mlk} {1:luther,king,l}   {2:king,king}   {3:boulevard}
TOKEN_FILTER: name_synonyms_multiplexer
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: icu_folding
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: remove_ordinals
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: unique_only_same_position
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: notnull
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: flatten_graph
{0:martin,ml,mall,mill,m,mlk}   {1:luther,king,kg,l}    {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}

In this case MLK is position:0 and Boulevard is position:3 🤷‍♂️