pelias / pelias

Pelias is a modular open-source geocoder using Elasticsearch.
https://pelias.io
MIT License
3.2k stars 221 forks source link

autocomplete street synonyms #563

Closed missinglink closed 5 years ago

missinglink commented 7 years ago

difference in matching between st and street:

/v1/autocomplete?text=901 hastings st

1)  901 Hastings St, Pittsburgh, PA, USA
2)  901 E Hastings St, Vancouver, Canada
/v1/autocomplete?text=901 hastings street

... no results
missinglink commented 7 years ago

this issue is caused by the way synonym expansions are handled by the peliasIndexOneEdgeGram analyser (the one which produces ngrams for the name.default field).

specifically, the issue is only when the source data provides the street suffix in it's abbreviated form (ie. ave), it does not affect source data which is provided in its expanded form (ie. avenue).

additionally, the bug only affects a subset of street suffixes we deem as 'unsafe'. The term 'unsafe' refers to the fact that we cannot, with great enough certainty, establish which expansion is the correct substitution for the abbreviation found in the source data.

at time of writing, the current list of 'unsafe' street suffix abbreviations is:

ave, br, byp, cir, cl, con, cor, ct, cres, curv, dr, esp, 
ext, gln, is, orch, pr, pl, riv, sq, st, terr, tr, vis

there is currently also some ambiguity to the term 'unsafe' in that it also refers to partially specified input tokens (from autocomplete) which are not yet safe to expand.

originally the term 'unsafe' was used to cover both cases, but it might be better now to distinguish the two cases.

some examples of why this is difficult are:

This could expand to Main Street or Main Straße etc. depending on the locale.

This most likely refers to Saint Pauls or Sankt Pauls etc. and should not be considered synonymous with the input text Street Pauls and certainly not with Pauls Street

This could be the start of many different words including Istanbul and Islington, it should not be expanded to Island and also should not be expanded to Island OR Istanbul OR Islington, otherwise a query such as Angel Island would match Angel Islington.

In order to make the correct expansion, we need more context, it looks like at a minimum the analyzer will need to be aware of:

as it is currently, the elasticsearch analysers are not intelligent enough to perform these sort of substitutions. the level of logical if-else-then type operations would suggest that this sort of complex analysis would be best handled by a 'proper programming language', which would also take care of testing the edge cases and providing a level of assurance of each substitutions 'safety' (in context).

the simplest solution, for now, is to request that source data is provided in its expanded form, this should be considered best practice due to the reasons mentioned above.

missinglink commented 7 years ago

note: there are existing tools such as libpostal which have locale awareness, these sort of tools can be used to expand the source data abbreviations before being sent to elasticsearch for indexing.

orangejulius commented 5 years ago

This is fixed and both queries listed above now return the same (correct) results. It's likely been fixed since https://github.com/pelias/schema/pull/310