Closed missinglink closed 4 years ago
Oh nice, yeah, this duplication has always been annoying for us, and has lead to much confusion for Pelias users.
I'm happy to merge this now, but the thing is that the synonyms/custom_street.txt
file is empty by default so it's going to be a no-op for anyone running the canonical synonyms which ship with Pelias.
@Joxit 👍 or 👎 ?
Yep, I'm ok with this :smile: this will help me for future builds !
Do we need this PR after #453? I don't think so, but just wanted to confirm. Feel free to close if it's no longer needed.
This has been resolved in a cleaner way using the name_synonyms_multiplexer
oh... actually... the multiplexer still doesn't list custom_street
.. I think the issue is still valid and unresolved but the code will require some modernisation to include custom_street
within name_synonyms_multiplexer
.
I'd actually have preferred to delete custom_*
synonym files as part of the recent refactor, from what I've seen of 3rd-party synonyms files they are usually very messy and error-prone.
I didn't drop support for custom_*
synonym files because developers are already using them and it would potentially break functionality.
I'm not going to do any further work on this topic but I'd be happy to review and merge a PR from a developer who is actively using custom synonyms and can test it out on their setup.
This issue came up via a client today, I've modernized the code as per https://github.com/pelias/schema/pull/449#issuecomment-658084758 and rebased it against master
.
In addition to including custom_street
in the name.*
and phrase.*
indices, I'm also now including custom_admin
for completeness.
I'd like to solicit some feedback before merging this, I suspect it will not have any issues besides the pervasive 'viral synonyms' issue related to multi-word synonyms that we inherited from Lucene and have always had to deal with.
A good example of where this is relevant is "MLK" vs "Martin Luther King" vs "Martin Luther King Jr" which is a common synonym in the United States but also unique enough in a global context to possibly not include in the core Pelias synonyms files.
@joxit this got put on hold when I was doing the recent synonyms work but is open for testing and merging now.
It's slightly different what what you tested last time because I've rebased it against master to bring in those changes.
We don't use custom synonyms so any feedback you could provide would be helpful to confirm it's good to go.
I was hoping that it would be able to replace https://github.com/Joxit/pelias-schema/commit/514fb027cb5dd3b2be5ad1f4da62db0f24f20951
In fact, since #453 includes all the synonyms I need, I am no longer using my fork (I did a full build yesterday).
I might need the custom street again someday :man_shrugging:
This PR is still OK and custom_street
and custom_admin
are definitely useful in peliasPhrase
I confirmed this works as expected by adding the following line to custom_street.txt
:
note: no other custom synonyms are enabled
Martin Luther King,ML King,M L King,MLK
And then running an elyzer
query against peliasPhrase
(which is used on the name. and phrase. fields).
elyzer --index pelias --analyzer peliasPhrase 'MLK Boulevard'
CHAR_FILTER: punctuation
{0:MLK Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:MLK Boulevard}
TOKENIZER: peliasTokenizer
{0:MLK} {1:Boulevard}
TOKEN_FILTER: lowercase
{0:mlk} {1:boulevard}
TOKEN_FILTER: trim
{0:mlk} {1:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:mlk} {1:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:mlk} {1:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:mlk,martin,ml,m} {1:boulevard,luther,king,l} {2:king,king}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:mlk,martin,ml,m} {1:boulevard,luther,king,l} {2:king,king}
TOKEN_FILTER: name_synonyms_multiplexer
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
TOKEN_FILTER: icu_folding
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
TOKEN_FILTER: remove_ordinals
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
TOKEN_FILTER: unique_only_same_position
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
TOKEN_FILTER: notnull
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
TOKEN_FILTER: flatten_graph
{0:mlk,martin,ml,mall,mill,m} {1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l} {2:king,kg}
The above also highlights the issue with multi-word synonyms where boulevard
(and variants) are indexed at position:1
for xxx boulevard
and not at position:3
as we would prefer for xxx xxx xxx boulevard
.
The inverse is true for when the data provider gives us Martin Luther King Boulevard
:
elyzer --index pelias --analyzer peliasPhrase 'Martin Luther King Boulevard'
CHAR_FILTER: punctuation
{0:Martin Luther King Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:Martin Luther King Boulevard}
TOKENIZER: peliasTokenizer
{0:Martin} {1:Luther} {2:King} {3:Boulevard}
TOKEN_FILTER: lowercase
{0:martin} {1:luther} {2:king} {3:boulevard}
TOKEN_FILTER: trim
{0:martin} {1:luther} {2:king} {3:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:martin} {1:luther} {2:king} {3:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:martin} {1:luther} {2:king} {3:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:martin,ml,m,mlk} {1:luther,king,l} {2:king,king} {3:boulevard}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:martin,ml,m,mlk} {1:luther,king,l} {2:king,king} {3:boulevard}
TOKEN_FILTER: name_synonyms_multiplexer
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: icu_folding
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: remove_ordinals
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: unique_only_same_position
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: notnull
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: flatten_graph
{0:martin,ml,mall,mill,m,mlk} {1:luther,king,kg,l} {2:king,kg} {3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
In this case MLK
is position:0
and Boulevard
is position:3
🤷♂️
As mentioned in https://github.com/pelias/schema/pull/446#discussion_r440124764
If developers extend the synonyms lists by editing
synonyms/custom_street.txt
the synonyms are only being applied to theaddress_parts.street
field and not toname.*
&phrase.*
indices.The problem with this is that we use the
name.*
&phrase.*
indices as top-levelMUST
conditions in most of our queries so adding custom street synonyms has very little effect.The current workaround is for developer to manually add the synonyms filter to the indices as per: https://github.com/Joxit/pelias-schema/commit/514fb027cb5dd3b2be5ad1f4da62db0f24f20951
I think this was an oversight at the time and worth fixing.