pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
219 stars 163 forks source link

Searching for statue of liberty in Russian should find Statue of Liberty #127

Open missinglink opened 9 years ago

missinglink commented 9 years ago

currently /search is only searching against name.default, it would be better if the search was performed against any of the name[*] values.

rel: pelias/pelias#83 re: pelias/schema#49

missinglink commented 9 years ago

hey @dianashk I was thinking about this again and I thought I should mention that although we match on all name.* fields only name.default is returned to the user and displayed in the front-end.

in effect this means that a record like [1] which has the following tags will be searchable via the russian name but will return the english name to the user:

{
  'name': 'Trafalgar Square',
  'name:ru': 'Трафальгарская площадь'
}

the above is not too big a deal as we are providing an english-only service at the moment; there are also examples like the mayors office in london [2] with the following tags:

{
  'name': '30 St Mary Axe',
  'loc_name': 'The Gherkin'
}

the building itself it called '30 St Mary Axe' but absolutely everyone affectionately refers to it as 'The Gherkin' [3], so if you searched for 'The Gherkin', your result would match and return '30 St Mary Axe'.

/suggest already has the same behaviour here: https://pelias.mapzen.com/suggest?input=The%20Gherkin&lat=51.53177&lon=-0.06672&size=10&zoom=18

so.. yea, not sure if it's a big deal, what are your thoughts?

[1] https://www.openstreetmap.org/relation/3962877 [2] https://www.openstreetmap.org/way/4959489 [3] gherkin

dianashk commented 9 years ago

@missinglink, I agree it's odd from the user's perspective. We should open a separate ticket for that change though and think through the implications and correct expected behavior.

related to #137

dianashk commented 9 years ago

Reverted the changes on prod because tests were failing. Needs further investigation.

vesameskanen commented 8 years ago

Issue #70 'search against all name[*] properties' was reverted at some point. Do you have any plans to implement multi-language searches again? This has high importance for us, and we would like to contribute here if possible. It would be interesting to know how developers, who know Pelias thoroughly, foresee the way this will be eventually handled.

Also, it would be useful to expand the language support to all admin levels, not only names. For example:

Viipurinkatu, Alppila, Helsinki Viborgsgatan, Alphyddan, Helsingfors

are both valid searches here.

I'd appreciate highly all available guidance and thoughts about these issues, before we try to address the topic ourselves.

missinglink commented 8 years ago

hi @vesameskanen there has been a lot of discussion about alternate names and internationalization over the last year or so, the main points of contention remain:

1) data

translations do not exist for all labels, this means that we could 'prefer' English while supporting other languages, the resulting labels would be a mixed-bag of English and another language, eg. "Трафальгарская площадь, London, United Kingdom", the venue name is Russian and the region name English; which feels wrong to me.

the task of expanding our collection of non-English labels for administrative areas is currently being addressed as we switch over from Quattroshapes to "Who's on First" for our geographic polygon data, the latter is better suited to collate translations and collaborate on getting them done where they currently don't exist.

we do currently import a range of names from openstreetmap data (more info) but as you rightly pointed out the ability to retrieve them has currently been disabled, originally due to performance, and remains so due all the reasons I mentioned here (we'd love to solve this problem).

2) behaviour

the system was originally designed with one label per document, we would need to modify the logic in order to dynamically generate the label, most likely depending on the users browser locale.

when requests come from the browser they send a header such as accept-language:en-GB,en;q=0.8c which could be used to format labels appropriately.

the other option would be to provide a 'filter' mechanism, for instance you could only ask for Thai data. in this case the responses would be limited to only content in the Thai language and so the experience would be poor in the Western world (for instance). it's similar to browsing wikipedia in Swahili, not all articles exist.

3) performance

we originally disabled the feature due to performance reasons, since then we've made significant improvements in performance, especially in the area of autocomplete.

when adding hundreds-of-millions of new records it's an important consideration, simply searching all the content is not going to work at scale (think 1B+ records per keypress).

we're working on a new language-detection algorithm which should allow us to determine the input language and then target the query against only the documents which match that language.

4) analysis

language analysis techniques vary greatly between languages, we've tried a bunch of different linguistic analysis techniques in the past including stemming and in the end we found that we don't need full language-specific stemmers (such as snowball), so this is good news for performance.

what we do need is 'cultural awareness' in terms of postal formats, street suffixes etc. which make the product more personal/local for non-English speakers.


as you can see there is a fair bit of work to be done, we'd be really happy to collaborate on discussing some of these issues in more detail. it's unlikely that the core team is going to do a development cycle on internationalization in the first half of 2016 but we are more that happy to start discussing it.

dianashk commented 8 years ago

Hey @vesameskanen, we're really excited that you are looking to collaborate with us on this feature. Hopefully @missinglink has laid out some of the highlights from our previous attempts and discussions. These are all points of discussion so please weigh in on anything mentioned above.

With the previous approach we saw a lot of noise results and had to revert the change until it could be examined closer.

We've actually been considering picking up the Alt-Names milestone for Q2 of 2016, along with We're Not Alone, so your interest in this functionality is quiet timely. :simple_smile:

vesameskanen commented 8 years ago

Hello @missinglink and @dianashk

Many thanks for the detailed and helpful answers. I agree on the validity of statements above in a general search situation, However, Pelias is a great tool for local geocoding applications, too, where we do not have that much performance or data quality concerns. So, it would be great if the behavior of Pelias searches could be configured to be optimal in such cases.

Using the language of the browser or perhaps an explicit language selector in the host app is an awkward solution. We know that Pelias could find all language versions of names, so why to make that an extra concern for the user. In a local scale, where we need to support only a couple of languages, automatic search across all languages seems to be the best solution.

So, perhaps the right direction is to add more configurability to the query layer. @dianashk, I am still new to Pelias, but I am happy to contribute if I can.

vesameskanen commented 8 years ago

Some further thoughts:

vesameskanen commented 8 years ago

I created an api branch which 'jsonifies' query building, and added a little 'multingram' view to the query library. These changes are currently on master branches of my github repos (api, query).

So, unit tests work as before, but using my custom pelias config I can set up new kind of multi match queries.

If you have time, please take a look and let me know if you find such query configurability interesting, or if there are potential problems/areas to improve in this approach.

PS. Our 3500 location fuzzy test for original default names gave slightly better scores with the multingram search. Of course, searches with translated names improved radically.

dianashk commented 8 years ago

Hey @vesameskanen, thanks for letting us know about your experiments. It's exciting to hear that your test results have improved since making the changes. As mentioned before, we're planning to tackle alt-name search in a few weeks, so we'll definitely check out what you've done and provide feedback soon.

Cheers!

mihneadb commented 5 years ago

@missinglink WDYT about a small improvement of relying on the lang parameter when it is provided? Extending the queries to do an or between name.$LANG and name.default when lang is provided. I'm thinking this wouldn't hurt perf as much as going for name.*. Thanks!

missinglink commented 5 years ago

Yes I think this is a good idea.

The name.$LANG fields predated the language detection middleware and that's the only reason it hasn't been done.

There was also an issue regarding performance but I agree that checking two fields should not have a massive performance penalty.

missinglink commented 5 years ago

I have been looking at the elasticsearch docs for multi_match queries again recently for another feature and I found them to be very capable and configurable.

missinglink commented 5 years ago

Also worth mentioning that this thread is pretty old and some of the concerns from 2+ years ago are either solved or less relevant.

missinglink commented 5 years ago

I'm going to put my name on this to track it but I probably won't have time to do a PR in the near future. @mihneadb if you'd like to open a PR then I'd be happy to review it and help you get it merged :)

mihneadb commented 5 years ago

@missinglink sure, I'll take a look. Can you please give me some pointers though? From what I can understand, I'd probably have to make some changes to https://github.com/pelias/query/blob/e141f2bbcb9989548ba6540f1daa820199a3b9b4/view/phrase.js? Not sure about the "api design". Do we want phrase:field to support a list of strings? And detect that in phrase.js and act accordingly?

Also - according to this, multi_match does not support fuzziness, so we cannot use it. I'm thinking we can do a bool with should and have a list of two identical match queries just with different fields (name.default, name.$LANG). Wdyt?

missinglink commented 5 years ago

I don't think that is correct, multi_match does support fuzziness and also phrase but not both at the same time.

Also, accepts analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, rewrite, zero_terms_query, cutoff_frequency, auto_generate_synonyms_phrase_query and fuzzy_transpositions, as explained in match query.

missinglink commented 5 years ago

As for where to start, I would suggest having a look at the queries which are being generated, you can use the ?debug=true flag on the compare app, if you scroll down you'll see the elasticsearch query (or queries if the first ones returned 0 results).

The example above Трафальгарская площадь is an interesting case because it's a street, we currently don't support different languages on the street field as we do for name, so this might be more complex.

I'd start with something simpler, like Лондон which is a proper name.

Although this query already seems to work correctly: https://pelias.github.io/compare/#/v1/search%3Fdebug=true&lang=ru&text=%D0%9B%D0%BE%D0%BD%D0%B4%D0%BE%D0%BD

What is the test case you have in mind that you'd like to fix?

missinglink commented 5 years ago

There's actually two ways that alternative names are specified in Pelias, there are the name.* fields which we currently don't query on, these fields contain the proper name of the place.

Then there is a concept of aliases which we introduced last year, any name field or address field can have multiple strings.

We use the first string for each field as the display string and we also use the other variants for improved matching.

To make it even more confusing, we send queries deemed 'admin-only' to placeholder (which you can also see in the debug view), this seems to be the reason the Лондон example is succeeding.

missinglink commented 5 years ago

Can we please start with a list of like 10 failing test cases? From there we can investigate further and map out a plan.

mihneadb commented 5 years ago

I don't think that is correct, multi_match does support fuzziness and also phrase but not both at the same time.

Also, accepts analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, rewrite, zero_terms_query, cutoff_frequency, auto_generate_synonyms_phrase_query and fuzzy_transpositions, as explained in match query.

Right, and the autocomplete query uses phrase already. Don't we want to keep that?

test case in mind

Similar example, also Russian - the Menshikov Palace in St Petersburg. The OSM entry has names in several languages for it. Here is the compare link (first result). I'd like to be able to find it while searching in English (name.en is what I'd search).

missinglink commented 5 years ago

We would want to maintain the existing query logic for now (and not change too much in one PR) So if a query type is phrase it should stay as phrase under multi_match.

If there is a query with both phrase and fuzziness specified then I believe that does not actually do what it sounds like. More info here: https://github.com/pelias/api/pull/1268

missinglink commented 5 years ago

That's a great testcase: /v1/search?lang=en&text=Menshikov Palace should return openstreetmap:venue:way/216655164

missinglink commented 5 years ago

So the subqueries would need to be changed from:

{
  "match": {
    "phrase.default": {
      "analyzer": "peliasPhrase",
      "type": "phrase",
      "boost": 1,
      "slop": 2,
      "query": "Menshikov Palace",
      "cutoff_frequency": 0.01
    }
  }
}

... to something like this:

{
  "multi_match": {
    "fields": ["phrase.default", "phrase.en"],
    "analyzer": "peliasPhrase",
    "type": "phrase",
    "boost": 1,
    "slop": 2,
    "query": "Menshikov Palace",
    "cutoff_frequency": 0.01
  }
}

note: I didn't test this, the syntax might be wrong

missinglink commented 5 years ago

We already use this pattern for admin matching:

Oh actually the boost param is probably not valid for multi_match, you can use the ^ instead to set boosts per field.

mihneadb commented 5 years ago

We would want to maintain the existing query logic for now (and not change too much in one PR) So if a query type is phrase it should stay as phrase under multi_match.

If there is a query with both phrase and fuzziness specified then I believe that does not actually do what it sounds like. More info here: #1268

The reason why I went down that path is because of the code here: https://github.com/pelias/query/blob/e141f2bbcb9989548ba6540f1daa820199a3b9b4/view/phrase.js#L25 Looks like the code already puts phrase & fuzziness together. Or am I missing something? :S

missinglink commented 5 years ago

I believe that if you specify both phrase and fuzziness in the same query then the fuzziness has no effect at all.

It's super confusing and IMHO is not very well documented by elasticsearch that this is the behaviour.

missinglink commented 5 years ago

And yes, we seem to do that, which is wrong.

mihneadb commented 5 years ago

Right.

I have an unanswered question from earlier:

Not sure about the "api design". Do we want phrase:field to support a list of strings? And detect that in phrase.js and act accordingly?

Trying to figure out all the open questions to be able to come up with a precise scope for this PR. Thanks!

bboure commented 5 years ago

Hi,

It would be worth mentioning that apart from the name.[LANG] fields that are currently not being matched, there is also, sometimes a name.old field that might be worth taking into consideration. Currently, looking for a place by its former name gives no result.

See this example: New name: Torre Glories Old name: Torre Agbar OSM Reference: https://www.openstreetmap.org/way/44213122