pelias / pelias

Pelias is a modular open-source geocoder using Elasticsearch.
https://pelias.io
MIT License
3.21k stars 222 forks source link

autocomplete results should weight feature priority over string similarity #143

Closed bdon closed 8 years ago

bdon commented 9 years ago

When I type "London", i feel like it should start appearing first in autocomplete results for [Lon,Lond,Londo] even though there are other results that offer exact name matches.

missinglink commented 9 years ago

hey @bdon, this is really the $1M question and its a big UX decision either way

we've had this discussion in our team before in different forms, eg @dianashk would like to see her local chinese restaurant 'didi dumpling' in the results for the text 'didi'.

the $1M question is basically "at what point do important (think local, popular) places with poorer linguistic matching score higher than places with better linguistic matching?"

so, let's have this discussion in the open, as a community and we can decide what we like better. and I'll just add that there is no real 'right' or 'wrong' answer here, just what 'most people prefer'.

What we currently have is configured is that:

  1. better linguistic matches come higher in the list (matching words gets a higher boost than characters)
  2. 'important' places get a boost (only when they match at least one word, not just one or two characters)
  3. 'nearer' places get a boost (only when they match at least one word, not just one or two characters)

an example of 1) is text=Bulls:

 1) Bulls, Rangitikei District, New Zealand
 2) Bulles, BULLES, France

an example of 2) is text=London:

 1) London, Greater London, United Kingdom
 2) London, Middlesex, Canada

an example of 3) is focus.point.lat=52.5&focus.point.lon=13.3&text=hard rock cafe:

 1) Hard Rock Cafe Berlin, Berlin, Germany
 2) Hard Rock Cafe, Warszawa, Poland

what you are noticing is that for /autocomplete when you partially specify a word token then you don't see the effects of 2) or 3) which is correct. We used to have it like this but the performance suffered as a result of doing all those calculations for all the partially matching places.

based on the 3 configurations I mentioned above you can see that for searches like text=lond, you get:

 1) Lond, Balochistān, Pakistan
 2) Lönd, East, Iceland

this is presumably because Lond, Balochistān, Pakistan is more 'important' (based on population and any popularity data we have, such as checkins, photo counts etc.)


so I actually totally agree with you that it would be great to see London, UK and London, ON in the list, probably at the top, but doing so would have a couple of side effects:

a good example of this is text=londo:

 1) Londo, Équateur, Democratic Republic of the Congo
 2) Londo, Bandundu, Democratic Republic of the Congo
 3) Londo, Bandundu, Democratic Republic of the Congo
 4) Londo, Katanga, Democratic Republic of the Congo
 5) Londo, Eastern Province, Democratic Republic of the Congo
 6) Londo, Nkhata Bay District, Malawi
 7) Londo, Orientale, Democratic Republic of the Congo
 8) Londo, Katanga, Democratic Republic of the Congo
 9) Londo, Shan, Myanmar
10) Londo, Maniema, Democratic Republic of the Congo

.. its seems there are loads of places called Londo in the world ;) and so 9) and 10) would be lost, never to be seen again. This is actually what the "Big G" does, and people seem to prefer it.

begs the question: "are we building this for metro residents in major cities/countries; or for everyone", and that's the crux of the $1M question.


long post sorry but I'll just add one more thing about local places with a poor linguistic match, if we were to boost local places who's names contains the input phrase higher, it might result in something like this focus.point.lat=40.7&focus.point.lon=-73.9&text=london:

 1) London NYC, Manhattan, NY
 2) New London Pharmacy, Manhattan, NY
 3) London Planetree Park, Queens, NY
 4) London Planetree Park, Queens, NY
 5) Long Cove, Brookhaven, NY
orangejulius commented 9 years ago

Is it possible to tweak the definition of "good linguistic match" used by Elasticsearch to put less value on whole word matches? If so it would be cool to experiment with this for the /autocomplete endpoint only, and leave /search as it is.

The rationale for this is that with autocomplete, it's a valid assumption that the user might not be done typing yet. In other words:

for /search?text=londo you are saying you want Londo

for /autocomplete?text=londo you might want Londo, you might want London, you might want London Cleaners

missinglink commented 9 years ago

the good news is that all of these settings and queries are configurable when deploying an instance of the pelias API layer, so if other businesses/individuals don't like the 'main' distribution they are able to build their own.

AFAIK this has never been possible in the past and so that is a huge win for the community, you can actually "build your own geocoder" now, which is awesome!

having said that we need to choose our canonical version of the settings/queries, what the core team feels are 'sane defaults'. these settings will benefit people instantly as they will be available via the mapzen search pelias cluster (search.mapzen.com). Organisations wouldn't have to build their own, so long as they are sufficiently happy with the settings we use.

I think @orangejulius is correct in saying that the UX for /search and /autocomplete are actually very different and that most users expect /autocomplete to be intuitive, while /search is expected to be more literal.

I guess the idea here is that if you don't find something in the typeahead, hit <ENTER> and you'll see a different set of responses for the same input; results which are less intuitive but more literal.

The more literal nature of /search is also important for automated scripting, batch geocoding etc. where returning London, UK for ?text=Lond would be a big no-no.

Let's experiment with what you suggested @orangejulius, the change itself is 2 lines of code (need to change view.phrase => view.ngrams) but the effects will be significant, I'm building some other stuff on the dev cluster right now but we could try this nearer the end of the week!

my general feeling is that this should be easily tunable in the future, I could imagine that when building a service for only openstreetmap or only openaddresses they may prefer a more literal /autcomplete.

tuukka commented 9 years ago

I'm thinking the same as @orangejulius. Regarding the autocomplete UI, I imagine there could be a search button to perform /search, or there could be one entry you could select saying "No, I really meant Londo not London".

macolu commented 8 years ago

Hi

I just tried to play with autocomplete results on search.mapzen.com.

I think current behavior is not intuitive, because a specific place that you see in prediction list can disappear when you keep on typing its name.

Let's use an example: Paris, France Let's assume that I live in Paris, so I'm focused to a lat/lng very close to Paris.

I start typing Par: https://search.mapzen.com/v1/autocomplete?api_key=KEY&text=Par&focus.point.lat=48.85&focus.point.lon=2.3

First two results are:

(BTW, Paris should probably be first, since it's closer, but this is not the subject here)

Now, I keep on typing Pari: https://search.mapzen.com/v1/autocomplete?api_key=KEY&text=Pari&focus.point.lat=48.85&focus.point.lon=2.3

Now, Paris disappeared from result list. Instead, I can see all the places named Pari all over the Earth.

I think that this is wrong. If I keep on typing a name that is returned, it should not disappear.

I didn't setup an instance to play with settings yet, I'll try soon.

riordan commented 8 years ago

@macolu We've been focusing a lot on this particular issue with autocomplete lately and have seen some dramatic improvements in this category. We'd love if you gave us another try and let us know if you see improvement.

riordan commented 8 years ago

This overall issue is one of the hardest things in building a great general-purpose geocoder.

We'll probably return to this ticket a few times over the next few months as we define the different parts of shifting weighting between feature priority vs string similarity and break it out into a bunch of sub-tickets.

macolu commented 8 years ago

Good news, thank you!

I'll have a look soon.

missinglink commented 8 years ago

Moving to 'in review', some significant work has been done in this area which requires review and regression test cases being created.

Moving this to 'in review' to be tested with the rest of the Autocomplete Improvements milestone.

missinglink commented 8 years ago

I am much happier with how this behaves now, waiting for the latest bunch of dev changes to make their way to production before merging this.

re: acceptance tests, loads have been added since October and I've just added some more for queries like "Londo" and "Pari", both return the major city as their top result.

dianashk commented 8 years ago

@macolu, looks like it's pretty good in production now. Lond shows London first! :tada: If you have any more concerns about this let us know. Closing this issue.