Closed missinglink closed 5 years ago
this issue is caused by the way synonym expansions are handled by the peliasIndexOneEdgeGram
analyser (the one which produces ngrams for the name.default
field).
specifically, the issue is only when the source data provides the street suffix in it's abbreviated form (ie. ave
), it does not affect source data which is provided in its expanded form (ie. avenue
).
additionally, the bug only affects a subset of street suffixes we deem as 'unsafe'. The term 'unsafe' refers to the fact that we cannot, with great enough certainty, establish which expansion is the correct substitution for the abbreviation found in the source data.
at time of writing, the current list of 'unsafe' street suffix abbreviations is:
ave, br, byp, cir, cl, con, cor, ct, cres, curv, dr, esp,
ext, gln, is, orch, pr, pl, riv, sq, st, terr, tr, vis
there is currently also some ambiguity to the term 'unsafe' in that it also refers to partially specified input tokens (from autocomplete) which are not yet safe to expand.
originally the term 'unsafe' was used to cover both cases, but it might be better now to distinguish the two cases.
some examples of why this is difficult are:
This could expand to Main Street
or Main Straße
etc. depending on the locale.
This most likely refers to Saint Pauls
or Sankt Pauls
etc. and should not be considered synonymous with the input text Street Pauls
and certainly not with Pauls Street
This could be the start of many different words including Istanbul
and Islington
, it should not be expanded to Island
and also should not be expanded to Island OR Istanbul OR Islington
, otherwise a query such as Angel Island
would match Angel Islington
.
In order to make the correct expansion, we need more context, it looks like at a minimum the analyzer will need to be aware of:
as it is currently, the elasticsearch analysers are not intelligent enough to perform these sort of substitutions. the level of logical if-else-then type operations would suggest that this sort of complex analysis would be best handled by a 'proper programming language', which would also take care of testing the edge cases and providing a level of assurance of each substitutions 'safety' (in context).
the simplest solution, for now, is to request that source data is provided in its expanded form, this should be considered best practice due to the reasons mentioned above.
note: there are existing tools such as libpostal which have locale awareness, these sort of tools can be used to expand the source data abbreviations before being sent to elasticsearch for indexing.
This is fixed and both queries listed above now return the same (correct) results. It's likely been fixed since https://github.com/pelias/schema/pull/310
difference in matching between
st
andstreet
: