Open missinglink opened 9 years ago
hey @dianashk I was thinking about this again and I thought I should mention that although we match on all name.*
fields only name.default
is returned to the user and displayed in the front-end.
in effect this means that a record like [1] which has the following tags will be searchable via the russian name but will return the english name to the user:
{
'name': 'Trafalgar Square',
'name:ru': 'Трафальгарская площадь'
}
the above is not too big a deal as we are providing an english-only service at the moment; there are also examples like the mayors office in london [2] with the following tags:
{
'name': '30 St Mary Axe',
'loc_name': 'The Gherkin'
}
the building itself it called '30 St Mary Axe' but absolutely everyone affectionately refers to it as 'The Gherkin' [3], so if you searched for 'The Gherkin', your result would match and return '30 St Mary Axe'.
/suggest
already has the same behaviour here: https://pelias.mapzen.com/suggest?input=The%20Gherkin&lat=51.53177&lon=-0.06672&size=10&zoom=18
so.. yea, not sure if it's a big deal, what are your thoughts?
[1] https://www.openstreetmap.org/relation/3962877 [2] https://www.openstreetmap.org/way/4959489 [3]
@missinglink, I agree it's odd from the user's perspective. We should open a separate ticket for that change though and think through the implications and correct expected behavior.
related to #137
Reverted the changes on prod because tests were failing. Needs further investigation.
Issue #70 'search against all name[*] properties' was reverted at some point. Do you have any plans to implement multi-language searches again? This has high importance for us, and we would like to contribute here if possible. It would be interesting to know how developers, who know Pelias thoroughly, foresee the way this will be eventually handled.
Also, it would be useful to expand the language support to all admin levels, not only names. For example:
Viipurinkatu, Alppila, Helsinki Viborgsgatan, Alphyddan, Helsingfors
are both valid searches here.
I'd appreciate highly all available guidance and thoughts about these issues, before we try to address the topic ourselves.
hi @vesameskanen there has been a lot of discussion about alternate names and internationalization over the last year or so, the main points of contention remain:
1) data
translations do not exist for all labels, this means that we could 'prefer' English while supporting other languages, the resulting labels would be a mixed-bag of English and another language, eg. "Трафальгарская площадь, London, United Kingdom"
, the venue name is Russian and the region name English; which feels wrong to me.
the task of expanding our collection of non-English labels for administrative areas is currently being addressed as we switch over from Quattroshapes to "Who's on First" for our geographic polygon data, the latter is better suited to collate translations and collaborate on getting them done where they currently don't exist.
we do currently import a range of names from openstreetmap data (more info) but as you rightly pointed out the ability to retrieve them has currently been disabled, originally due to performance, and remains so due all the reasons I mentioned here (we'd love to solve this problem).
2) behaviour
the system was originally designed with one label per document, we would need to modify the logic in order to dynamically generate the label, most likely depending on the users browser locale.
when requests come from the browser they send a header such as accept-language:en-GB,en;q=0.8c
which could be used to format labels appropriately.
the other option would be to provide a 'filter' mechanism, for instance you could only ask for Thai data. in this case the responses would be limited to only content in the Thai language and so the experience would be poor in the Western world (for instance). it's similar to browsing wikipedia in Swahili, not all articles exist.
3) performance
we originally disabled the feature due to performance reasons, since then we've made significant improvements in performance, especially in the area of autocomplete.
when adding hundreds-of-millions of new records it's an important consideration, simply searching all the content is not going to work at scale (think 1B+ records per keypress).
we're working on a new language-detection algorithm which should allow us to determine the input language and then target the query against only the documents which match that language.
4) analysis
language analysis techniques vary greatly between languages, we've tried a bunch of different linguistic analysis techniques in the past including stemming and in the end we found that we don't need full language-specific stemmers (such as snowball), so this is good news for performance.
what we do need is 'cultural awareness' in terms of postal formats, street suffixes etc. which make the product more personal/local for non-English speakers.
as you can see there is a fair bit of work to be done, we'd be really happy to collaborate on discussing some of these issues in more detail. it's unlikely that the core team is going to do a development cycle on internationalization in the first half of 2016 but we are more that happy to start discussing it.
Hey @vesameskanen, we're really excited that you are looking to collaborate with us on this feature. Hopefully @missinglink has laid out some of the highlights from our previous attempts and discussions. These are all points of discussion so please weigh in on anything mentioned above.
With the previous approach we saw a lot of noise results and had to revert the change until it could be examined closer.
We've actually been considering picking up the Alt-Names milestone for Q2 of 2016, along with We're Not Alone, so your interest in this functionality is quiet timely. :simple_smile:
Hello @missinglink and @dianashk
Many thanks for the detailed and helpful answers. I agree on the validity of statements above in a general search situation, However, Pelias is a great tool for local geocoding applications, too, where we do not have that much performance or data quality concerns. So, it would be great if the behavior of Pelias searches could be configured to be optimal in such cases.
Using the language of the browser or perhaps an explicit language selector in the host app is an awkward solution. We know that Pelias could find all language versions of names, so why to make that an extra concern for the user. In a local scale, where we need to support only a couple of languages, automatic search across all languages seems to be the best solution.
So, perhaps the right direction is to add more configurability to the query layer. @dianashk, I am still new to Pelias, but I am happy to contribute if I can.
Some further thoughts:
I created an api branch which 'jsonifies' query building, and added a little 'multingram' view to the query library. These changes are currently on master branches of my github repos (api, query).
So, unit tests work as before, but using my custom pelias config I can set up new kind of multi match queries.
If you have time, please take a look and let me know if you find such query configurability interesting, or if there are potential problems/areas to improve in this approach.
PS. Our 3500 location fuzzy test for original default names gave slightly better scores with the multingram search. Of course, searches with translated names improved radically.
Hey @vesameskanen, thanks for letting us know about your experiments. It's exciting to hear that your test results have improved since making the changes. As mentioned before, we're planning to tackle alt-name search in a few weeks, so we'll definitely check out what you've done and provide feedback soon.
Cheers!
@missinglink WDYT about a small improvement of relying on the lang
parameter when it is provided? Extending the queries to do an or
between name.$LANG
and name.default
when lang
is provided. I'm thinking this wouldn't hurt perf as much as going for name.*
.
Thanks!
Yes I think this is a good idea.
The name.$LANG
fields predated the language detection middleware and that's the only reason it hasn't been done.
There was also an issue regarding performance but I agree that checking two fields should not have a massive performance penalty.
I have been looking at the elasticsearch docs for multi_match
queries again recently for another feature and I found them to be very capable and configurable.
Also worth mentioning that this thread is pretty old and some of the concerns from 2+ years ago are either solved or less relevant.
I'm going to put my name on this to track it but I probably won't have time to do a PR in the near future. @mihneadb if you'd like to open a PR then I'd be happy to review it and help you get it merged :)
@missinglink sure, I'll take a look. Can you please give me some pointers though? From what I can understand, I'd probably have to make some changes to https://github.com/pelias/query/blob/e141f2bbcb9989548ba6540f1daa820199a3b9b4/view/phrase.js?
Not sure about the "api design". Do we want phrase:field
to support a list of strings? And detect that in phrase.js
and act accordingly?
Also - according to this, multi_match
does not support fuzziness, so we cannot use it. I'm thinking we can do a bool
with should
and have a list of two identical match
queries just with different fields (name.default
, name.$LANG
). Wdyt?
I don't think that is correct, multi_match
does support fuzziness
and also phrase
but not both at the same time.
Also, accepts analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, rewrite, zero_terms_query, cutoff_frequency, auto_generate_synonyms_phrase_query and fuzzy_transpositions, as explained in match query.
As for where to start, I would suggest having a look at the queries which are being generated, you can use the ?debug=true
flag on the compare app, if you scroll down you'll see the elasticsearch query (or queries if the first ones returned 0 results).
The example above Трафальгарская площадь
is an interesting case because it's a street, we currently don't support different languages on the street field as we do for name, so this might be more complex.
I'd start with something simpler, like Лондон
which is a proper name.
Although this query already seems to work correctly: https://pelias.github.io/compare/#/v1/search%3Fdebug=true&lang=ru&text=%D0%9B%D0%BE%D0%BD%D0%B4%D0%BE%D0%BD
What is the test case you have in mind that you'd like to fix?
There's actually two ways that alternative names are specified in Pelias, there are the name.*
fields which we currently don't query on, these fields contain the proper name of the place.
Then there is a concept of aliases
which we introduced last year, any name field or address field can have multiple strings.
We use the first string for each field as the display string and we also use the other variants for improved matching.
To make it even more confusing, we send queries deemed 'admin-only' to placeholder (which you can also see in the debug view), this seems to be the reason the Лондон
example is succeeding.
Can we please start with a list of like 10 failing test cases? From there we can investigate further and map out a plan.
I don't think that is correct,
multi_match
does supportfuzziness
and alsophrase
but not both at the same time.Also, accepts analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, rewrite, zero_terms_query, cutoff_frequency, auto_generate_synonyms_phrase_query and fuzzy_transpositions, as explained in match query.
Right, and the autocomplete query uses phrase
already. Don't we want to keep that?
test case in mind
Similar example, also Russian - the Menshikov Palace in St Petersburg. The OSM entry has names in several languages for it. Here is the compare link (first result). I'd like to be able to find it while searching in English (name.en is what I'd search).
We would want to maintain the existing query logic for now (and not change too much in one PR)
So if a query type is phrase
it should stay as phrase
under multi_match
.
If there is a query with both phrase
and fuzziness
specified then I believe that does not actually do what it sounds like.
More info here: https://github.com/pelias/api/pull/1268
That's a great testcase: /v1/search?lang=en&text=Menshikov Palace
should return openstreetmap:venue:way/216655164
So the subqueries would need to be changed from:
{
"match": {
"phrase.default": {
"analyzer": "peliasPhrase",
"type": "phrase",
"boost": 1,
"slop": 2,
"query": "Menshikov Palace",
"cutoff_frequency": 0.01
}
}
}
... to something like this:
{
"multi_match": {
"fields": ["phrase.default", "phrase.en"],
"analyzer": "peliasPhrase",
"type": "phrase",
"boost": 1,
"slop": 2,
"query": "Menshikov Palace",
"cutoff_frequency": 0.01
}
}
note: I didn't test this, the syntax might be wrong
We already use this pattern for admin matching:
Oh actually the boost
param is probably not valid for multi_match
, you can use the ^
instead to set boosts per field.
We would want to maintain the existing query logic for now (and not change too much in one PR) So if a query type is
phrase
it should stay asphrase
undermulti_match
.If there is a query with both
phrase
andfuzziness
specified then I believe that does not actually do what it sounds like. More info here: #1268
The reason why I went down that path is because of the code here: https://github.com/pelias/query/blob/e141f2bbcb9989548ba6540f1daa820199a3b9b4/view/phrase.js#L25 Looks like the code already puts phrase & fuzziness together. Or am I missing something? :S
I believe that if you specify both phrase
and fuzziness
in the same query then the fuzziness
has no effect at all.
It's super confusing and IMHO is not very well documented by elasticsearch that this is the behaviour.
And yes, we seem to do that, which is wrong.
Right.
I have an unanswered question from earlier:
Not sure about the "api design". Do we want phrase:field to support a list of strings? And detect that in phrase.js and act accordingly?
Trying to figure out all the open questions to be able to come up with a precise scope for this PR. Thanks!
Hi,
It would be worth mentioning that apart from the name.[LANG]
fields that are currently not being matched, there is also, sometimes a name.old
field that might be worth taking into consideration.
Currently, looking for a place by its former name gives no result.
See this example: New name: Torre Glories Old name: Torre Agbar OSM Reference: https://www.openstreetmap.org/way/44213122
currently
/search
is only searching against name.default, it would be better if the search was performed against any of thename[*]
values.rel: pelias/pelias#83 re: pelias/schema#49